PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Fri Apr 19, 2024 3:24 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 9 posts ] 
Author Message
 Post subject: PDFSharp encoded still
PostPosted: Thu Mar 10, 2016 4:23 pm 
Offline

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5
I'm using PDFSharp 1.50 in my solution to open a PDF document, i've done this before, but now for some reason the command

Code:
outDoc = PdfReader.Open(sourceStream, PdfDocumentOpenMode.Import);


hits a catch with PDFReaderException. From here out the contents seem to still be encoded and completely unworkable. I'm not sure what happened seeing as though I use this same method to split millions of PDF pages a year based on conditions defined in the logic. That process doesn't seem to be working now either with same files...

why would code hit that exception while attempting to open a PDF's stream?

Attached is example of the contents of the PDF after opening:
Attachment:
pdfexception_encoded.gif
pdfexception_encoded.gif [ 12.5 KiB | Viewed 8552 times ]


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 10, 2016 4:34 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany
Hi!
rcast wrote:
Attached is example of the contents of the PDF after opening
That looks like a PDF file. Maybe they use their own character encoding.
I don't see how I could help you - I don't have the PDF, I don't have the code that leads to the exception.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 11, 2016 2:26 pm 
Offline

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5
The document cannot be posted publicly. I'm seeing now that the document that is failing is version 1.7 Acrobat(8.x) created with PDFlib+ PDI 9.0.0 (.NET/Win64) and the document that does work properly is version 1.7 Acrobat(8.x) created with Crystal Reports...

I'm using this code to take a Page and pull the contents as a string into a variable:

Code:
pageContents = page.Contents.CreateSingleContent().Stream.ToString();


pageContents contains the screenshot in previous post, the job fails with:

"A first chance exception of type 'PdfSharp.Pdf.IO.PdfReaderException' occurred in PdfSharp.dll"


Top
 Profile  
Reply with quote  
PostPosted: Fri Mar 11, 2016 2:54 pm 
Offline

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5
I did notice the file that is having trouble decrypting uses IDENTITY-H fonts. Is this an issue with PDFSharp?


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 23, 2016 2:05 am 
Offline

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5
Can some PDF's not be decoded by PDFSharp?


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 23, 2016 6:56 am 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 915
Location: CCAA
rcast wrote:
Can some PDF's not be decoded by PDFSharp?
Your screenshot in the first post shows the postscript instructions that draw a PDF page.
PDFsharp does not decode the postscript instructions for drawing - it never did.

If you want to extract text then it's up to you to make the decoding. PDF knows several methods of text encoding. If your code supports a subset of those methods only, then some files cannot be decoded with your code.


There are some PDF files that only produce gibberish when you mark a sentence and copy it to the clipboard. I assume you will have problems extracting text from such files as the character encoding is unusual or incomplete.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 31, 2016 9:16 am 
Offline

Joined: Thu Mar 31, 2016 8:04 am
Posts: 1
Hello,

I am also trying to extract text from a PDF with Identity-H encoding. From what I understand I will need to use an embedded cmap to achieve this.

Could someone point me in the right direction on how to extract the embedded ToUnicode map with PDFSharp and use it translate character codes?

Thanks,
Ian


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 14, 2016 1:46 pm 
Offline

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5
Let me ask, if the PDF includes any non-standard fonts that are not a subset of windows fonts, will PDFSharp have issues parsing the content?


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 14, 2016 1:51 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany
rcast wrote:
Let me ask, if the PDF includes any non-standard fonts that are not a subset of windows fonts, will PDFsharp have issues parsing the content?
No. Since PDFsharp does not parse the contents of the pages, there will be no issues parsing that.
If you want to extract text then do the parsing yourself or use a third-party library that does it.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 229 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group