PDFsharp & MigraDoc Foundation • View topic - Problem for getting text from PDF file

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Problem for getting text from PDF file

Moderator: Stefan Lange

Page 1 of 1

[ 7 posts ]

Print view

Previous topic | Next topic

Author

Message

robatjazi

Post subject: Problem for getting text from PDF file

Posted: Wed Sep 12, 2012 10:26 pm

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4

I want to extract data from PDF file.

following is my code:

Code:

PdfDocument document = PdfReader.Open(sTempFile, PdfDocumentOpenMode.ReadOnly);

            int pageNo = -1;
            string strStreamValue;
            byte[] streamValue;
            foreach (PdfPage page in document.Pages)
            {
               pageNo++;
               strStreamValue = "";

               // put the stream value for every element on the page in a string variable.
               for (int i = 0; i < page.Contents.Elements.Count; i++)
               {
                  PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;
                  streamValue = stream.Value;
                  foreach (byte b in streamValue)
                  {
                     strStreamValue += (char)b;
                  }
               }
               
            }

But what I am getting is some unreadable text like:

0.05 0 0 -0.05 0 792 cm
0 0 0 rg
0 0 0 RG
2 J
30 w
357 662 m
.
.
.
(!"#$%&'$'$'*%+,\)-.) Tj
0 -240 Td
(/012) Tj
0 -240 Td
(,\)-%345%$6%748'549') Tj
3840 480 Td
(&'$'$'*) Tj
0 -1020 Td
(:;'5) Tj
0 -270 Td
(:5<\)') Tj

Is there any encoding for PDF file?

Any one knows how I can fix this problem?

Thank you in advanc

Top

Thomas Hoevel

Post subject: Re: Problem for getting text from PDF file

Posted: Thu Sep 13, 2012 8:19 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3110
Location: Cologne, Germany

I clicked on "Search" and found these topics:
viewtopic.php?p=1603#p1603
viewtopic.php?p=5556#p5556

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

robatjazi

Post subject: Re: Problem for getting text from PDF file

Posted: Thu Sep 13, 2012 1:56 pm

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4

Hi Thomas,

Thank you for reply.

I looked those links and used one of them. Result was the same.

in this line of my code :

Code:

PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;

I get unreadable text like (!"#$%&'$'$'*%+,\)-.) Tj
As I know by command Tj we have text data. But those text data are not readable.
My question is Does PdfSharp handle encoding of pdf file? and How can I get the real text from pdf file by using PdfSharp

Regards,
Majid

Top

Thomas Hoevel

Post subject: Re: Problem for getting text from PDF file

Posted: Thu Sep 13, 2012 2:12 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3110
Location: Cologne, Germany

Hi, Majid,

PDFsharp does not decode the text for you.

BTW: some PDF files are not properly encoded. Try to select text in your file using Adobe Reader, copy it to the clipboard and paste it to e.g. Notepad. Do you get the correct text here?
If not: malformed PDF file.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

robatjazi

Post subject: Re: Problem for getting text from PDF file

Posted: Thu Sep 13, 2012 2:33 pm

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4

Hi Thomas ,

I can select text from Adobe Reader and copy it into Notepad. I see correct text.
Also I used iTExtsharp and this tool return correct text data from pdf file.
I do not know what is the problem

Thanks,
majid

Top

robatjazi

Post subject: Re: Problem for getting text from PDF file

Posted: Fri Sep 14, 2012 1:53 pm

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4

Hi Thomas ,

I found that there are different PDF specification versions like PDF-1.2, PDF-1.3, ...

PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data.
Do you have any comment or suggestion to handle this situation?

It would be great if you give me a clue that what should I do.

Thanks,
majid

Top

Thomas Hoevel

Post subject: Re: Problem for getting text from PDF file

Posted: Mon Sep 17, 2012 8:26 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3110
Location: Cologne, Germany

robatjazi wrote:

PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data.
Do you have any comment or suggestion to handle this situation?

I presume the 1.2 files come from a different application than the 1.3 files.
The files that work will still work when you change the header from 1.3 to 1.2. The files that do not work will still fail with PDFsharp if you change the header from 1.2 to 1.3. For Adobe Reader it should make no difference either.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

Page 1 of 1

[ 7 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: No registered users and 32 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum