PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sun Jun 30, 2024 2:18 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 7 posts ] 
Author Message
PostPosted: Wed Sep 12, 2012 10:26 pm 
Offline

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4
I want to extract data from PDF file.

following is my code:

Code:
PdfDocument document = PdfReader.Open(sTempFile, PdfDocumentOpenMode.ReadOnly);

            int pageNo = -1;
            string strStreamValue;
            byte[] streamValue;
            foreach (PdfPage page in document.Pages)
            {
               pageNo++;
               strStreamValue = "";

               // put the stream value for every element on the page in a string variable.
               for (int i = 0; i < page.Contents.Elements.Count; i++)
               {
                  PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;
                  streamValue = stream.Value;
                  foreach (byte b in streamValue)
                  {
                     strStreamValue += (char)b;
                  }
               }
               
            }


But what I am getting is some unreadable text like:

0.05 0 0 -0.05 0 792 cm
0 0 0 rg
0 0 0 RG
2 J
30 w
357 662 m
.
.
.
(!"#$%&'\('\)'*%+,\)-.) Tj
0 -240 Td
(/012) Tj
0 -240 Td
(,\)-%345%$6%748'549') Tj
3840 480 Td
(&'\('\)'*) Tj
0 -1020 Td
(:;'5) Tj
0 -270 Td
(:5<\)') Tj


Is there any encoding for PDF file?

Any one knows how I can fix this problem?

Thank you in advanc


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 13, 2012 8:19 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3100
Location: Cologne, Germany
I clicked on "Search" and found these topics:
viewtopic.php?p=1603#p1603
viewtopic.php?p=5556#p5556

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 13, 2012 1:56 pm 
Offline

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4
Hi Thomas,

Thank you for reply.

I looked those links and used one of them. Result was the same.

in this line of my code :
Code:
PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;

I get unreadable text like (!"#$%&'\('\)'*%+,\)-.) Tj
As I know by command Tj we have text data. But those text data are not readable.
My question is Does PdfSharp handle encoding of pdf file? and How can I get the real text from pdf file by using PdfSharp

Regards,
Majid


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 13, 2012 2:12 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3100
Location: Cologne, Germany
Hi, Majid,

PDFsharp does not decode the text for you.

BTW: some PDF files are not properly encoded. Try to select text in your file using Adobe Reader, copy it to the clipboard and paste it to e.g. Notepad. Do you get the correct text here?
If not: malformed PDF file.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 13, 2012 2:33 pm 
Offline

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4
Hi Thomas ,

I can select text from Adobe Reader and copy it into Notepad. I see correct text.
Also I used iTExtsharp and this tool return correct text data from pdf file.
I do not know what is the problem :(

Thanks,
majid


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 14, 2012 1:53 pm 
Offline

Joined: Wed Sep 12, 2012 10:07 pm
Posts: 4
Hi Thomas ,

I found that there are different PDF specification versions like PDF-1.2, PDF-1.3, ...

PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data.
Do you have any comment or suggestion to handle this situation?

It would be great if you give me a clue that what should I do.

Thanks,
majid


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 17, 2012 8:26 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3100
Location: Cologne, Germany
robatjazi wrote:
PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data.
Do you have any comment or suggestion to handle this situation?
I presume the 1.2 files come from a different application than the 1.3 files.
The files that work will still work when you change the header from 1.3 to 1.2. The files that do not work will still fail with PDFsharp if you change the header from 1.2 to 1.3. For Adobe Reader it should make no difference either.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 43 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group