PDFsharp & MigraDoc Foundation • View topic

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

PDFSharp encoded still

Moderator: Stefan Lange

Page 1 of 1

[ 9 posts ]

Print view

Previous topic | Next topic

Author

Message

rcast

Post subject: PDFSharp encoded still

Posted: Thu Mar 10, 2016 4:23 pm

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5

I'm using PDFSharp 1.50 in my solution to open a PDF document, i've done this before, but now for some reason the command

Code:

outDoc = PdfReader.Open(sourceStream, PdfDocumentOpenMode.Import);

hits a catch with PDFReaderException. From here out the contents seem to still be encoded and completely unworkable. I'm not sure what happened seeing as though I use this same method to split millions of PDF pages a year based on conditions defined in the logic. That process doesn't seem to be working now either with same files...

why would code hit that exception while attempting to open a PDF's stream?

Attached is example of the contents of the PDF after opening:

Attachment:

pdfexception_encoded.gif [ 12.5 KiB | Viewed 8552 times ]

Top

Thomas Hoevel

Post subject: Re: PDFSharp encoded still

Posted: Thu Mar 10, 2016 4:34 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

Hi!

rcast wrote:

Attached is example of the contents of the PDF after opening

That looks like a PDF file. Maybe they use their own character encoding.
I don't see how I could help you - I don't have the PDF, I don't have the code that leads to the exception.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

rcast

Post subject: Re: PDFSharp encoded still

Posted: Fri Mar 11, 2016 2:26 pm

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5

The document cannot be posted publicly. I'm seeing now that the document that is failing is version 1.7 Acrobat(8.x) created with PDFlib+ PDI 9.0.0 (.NET/Win64) and the document that does work properly is version 1.7 Acrobat(8.x) created with Crystal Reports...

I'm using this code to take a Page and pull the contents as a string into a variable:

Code:

pageContents = page.Contents.CreateSingleContent().Stream.ToString();

pageContents contains the screenshot in previous post, the job fails with:

"A first chance exception of type 'PdfSharp.Pdf.IO.PdfReaderException' occurred in PdfSharp.dll"

Top

rcast

Post subject: Re: PDFSharp encoded still

Posted: Fri Mar 11, 2016 2:54 pm

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5

I did notice the file that is having trouble decrypting uses IDENTITY-H fonts. Is this an issue with PDFSharp?

Top

rcast

Post subject: Re: PDFSharp encoded still

Posted: Wed Mar 23, 2016 2:05 am

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5

Can some PDF's not be decoded by PDFSharp?

Top

TH-Soft

Post subject: Re: PDFSharp encoded still

Posted: Wed Mar 23, 2016 6:56 am

PDFsharp Expert

Joined: Sat Mar 14, 2015 10:15 am
Posts: 915
Location: CCAA

rcast wrote:

Can some PDF's not be decoded by PDFSharp?

Your screenshot in the first post shows the postscript instructions that draw a PDF page.
PDFsharp does not decode the postscript instructions for drawing - it never did.

If you want to extract text then it's up to you to make the decoding. PDF knows several methods of text encoding. If your code supports a subset of those methods only, then some files cannot be decoded with your code.

There are some PDF files that only produce gibberish when you mark a sentence and copy it to the clipboard. I assume you will have problems extracting text from such files as the character encoding is unusual or incomplete.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

irogers

Post subject: Re: PDFSharp encoded still

Posted: Thu Mar 31, 2016 9:16 am

Joined: Thu Mar 31, 2016 8:04 am
Posts: 1

Hello,

I am also trying to extract text from a PDF with Identity-H encoding. From what I understand I will need to use an embedded cmap to achieve this.

Could someone point me in the right direction on how to extract the embedded ToUnicode map with PDFSharp and use it translate character codes?

Thanks,
Ian

Top

rcast

Post subject: Re: PDFSharp encoded still

Posted: Thu Apr 14, 2016 1:46 pm

Joined: Thu Mar 10, 2016 4:16 pm
Posts: 5

Let me ask, if the PDF includes any non-standard fonts that are not a subset of windows fonts, will PDFSharp have issues parsing the content?

Top

Thomas Hoevel

Post subject: Re: PDFSharp encoded still

Posted: Thu Apr 14, 2016 1:51 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

rcast wrote:

Let me ask, if the PDF includes any non-standard fonts that are not a subset of windows fonts, will PDFsharp have issues parsing the content?

No. Since PDFsharp does not parse the contents of the pages, there will be no issues parsing that.
If you want to extract text then do the parsing yourself or use a third-party library that does it.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

Page 1 of 1

[ 9 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: No registered users and 229 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum