PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Problem for getting text from PDF file
https://forum.pdfsharp.net/viewtopic.php?f=2&t=2138
Page 1 of 1

Author:  robatjazi [ Wed Sep 12, 2012 10:26 pm ]
Post subject:  Problem for getting text from PDF file

I want to extract data from PDF file.

following is my code:

Code:
PdfDocument document = PdfReader.Open(sTempFile, PdfDocumentOpenMode.ReadOnly);

            int pageNo = -1;
            string strStreamValue;
            byte[] streamValue;
            foreach (PdfPage page in document.Pages)
            {
               pageNo++;
               strStreamValue = "";

               // put the stream value for every element on the page in a string variable.
               for (int i = 0; i < page.Contents.Elements.Count; i++)
               {
                  PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;
                  streamValue = stream.Value;
                  foreach (byte b in streamValue)
                  {
                     strStreamValue += (char)b;
                  }
               }
               
            }


But what I am getting is some unreadable text like:

0.05 0 0 -0.05 0 792 cm
0 0 0 rg
0 0 0 RG
2 J
30 w
357 662 m
.
.
.
(!"#$%&'\('\)'*%+,\)-.) Tj
0 -240 Td
(/012) Tj
0 -240 Td
(,\)-%345%$6%748'549') Tj
3840 480 Td
(&'\('\)'*) Tj
0 -1020 Td
(:;'5) Tj
0 -270 Td
(:5<\)') Tj


Is there any encoding for PDF file?

Any one knows how I can fix this problem?

Thank you in advanc

Author:  Thomas Hoevel [ Thu Sep 13, 2012 8:19 am ]
Post subject:  Re: Problem for getting text from PDF file

I clicked on "Search" and found these topics:
viewtopic.php?p=1603#p1603
viewtopic.php?p=5556#p5556

Author:  robatjazi [ Thu Sep 13, 2012 1:56 pm ]
Post subject:  Re: Problem for getting text from PDF file

Hi Thomas,

Thank you for reply.

I looked those links and used one of them. Result was the same.

in this line of my code :
Code:
PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;

I get unreadable text like (!"#$%&'\('\)'*%+,\)-.) Tj
As I know by command Tj we have text data. But those text data are not readable.
My question is Does PdfSharp handle encoding of pdf file? and How can I get the real text from pdf file by using PdfSharp

Regards,
Majid

Author:  Thomas Hoevel [ Thu Sep 13, 2012 2:12 pm ]
Post subject:  Re: Problem for getting text from PDF file

Hi, Majid,

PDFsharp does not decode the text for you.

BTW: some PDF files are not properly encoded. Try to select text in your file using Adobe Reader, copy it to the clipboard and paste it to e.g. Notepad. Do you get the correct text here?
If not: malformed PDF file.

Author:  robatjazi [ Thu Sep 13, 2012 2:33 pm ]
Post subject:  Re: Problem for getting text from PDF file

Hi Thomas ,

I can select text from Adobe Reader and copy it into Notepad. I see correct text.
Also I used iTExtsharp and this tool return correct text data from pdf file.
I do not know what is the problem :(

Thanks,
majid

Author:  robatjazi [ Fri Sep 14, 2012 1:53 pm ]
Post subject:  Re: Problem for getting text from PDF file

Hi Thomas ,

I found that there are different PDF specification versions like PDF-1.2, PDF-1.3, ...

PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data.
Do you have any comment or suggestion to handle this situation?

It would be great if you give me a clue that what should I do.

Thanks,
majid

Author:  Thomas Hoevel [ Mon Sep 17, 2012 8:26 am ]
Post subject:  Re: Problem for getting text from PDF file

robatjazi wrote:
PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data.
Do you have any comment or suggestion to handle this situation?
I presume the 1.2 files come from a different application than the 1.3 files.
The files that work will still work when you change the header from 1.3 to 1.2. The files that do not work will still fail with PDFsharp if you change the header from 1.2 to 1.3. For Adobe Reader it should make no difference either.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/