PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sat Feb 22, 2020 5:06 pm

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Post new topic Reply to topic  [ 4 posts ] 
Author Message
PostPosted: Mon Feb 10, 2020 5:15 pm 

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 3
I am not sure if this is the intended behavior or a defect.

When a pdf being parsed has an entry in the xRef table beyond the array[0] that is invalid or in my case all zeros, the method ReadXRefTableAndTrailer in file Parser.c throws an invalid entry preventing my customers from processing their uploaded pdf.

Looking at the xRef table for the document i see an object address of 0000000000 0000 n at array[11]. (Assuming a corrupted PDF conversion of a word doc) .

This would seem to be invalid if 000000000 is reserved for the document header at array[0].

Parsing an address of 000000000 at array[0 + 1++] will return a 0 and return the header object which will fail .

I added a check on the line where comment "//skip start entry" is (line 1081) that checks for the first iterator being 0 (id == 0) and skips parsing the header object .

To that line i added || id > 0 && position == 0 ie:

if(id = 0 || id > 0 && position == 0 ) continue;

In this case i simply toss the invalid xRef entry and let rest of the logic rebuild the xRef table.

I have limited knowledge of Pdf Standards so i am not sure what would be the desired behavior here : toss the invalid reference and rebuild the xRef table or throw a fatal exception and notify the consumer.

It seems to me however that an object address of 0000000000 at array[0 + 1++] points to nowhere so why not just simply toss it.

Or is there a possible reason an xRef address at 0000000000 at an array[0 + 1++] position could reference the document header?

iText sharp handles this situation gracefully.

PDfSharp is throwing a fatal exception because of the issues described above.

So I am not sure if this is a defect, an oversight, or desired behavior.

Please advise .

I cannot attach the offending file because it contains confidential information . But below is its xRef table:

0 39
0000000000 65535 f
0000055206 00000 n
0000008162 00000 n
0000045431 00000 n
0000000022 00000 n
0000008142 00000 n
0000008276 00000 n
0000008489 00000 n
0000030529 00000 n
0000045395 00000 n
0000030550 00000 n
0000042530 00000 n
0000000000 00000 n <---- This one . Why not just toss this in Parser.c ReadXRefTableAndTrailer() ?
0000050986 00000 n
0000000000 00000 n
0000045574 00000 n
0000042552 00000 n
0000042605 00000 n
0000042659 00000 n
0000045374 00000 n
0000045524 00000 n
0000046606 00000 n
0000045953 00000 n
0000046586 00000 n
0000046854 00000 n
0000050965 00000 n
0000051818 00000 n
0000051290 00000 n
0000051798 00000 n
0000052071 00000 n
0000054917 00000 n
0000054938 00000 n
0000054965 00000 n
0000055040 00000 n
0000055083 00000 n
0000055102 00000 n
0000055125 00000 n
0000055167 00000 n
0000055186 00000 n

Reply with quote  
PostPosted: Mon Feb 10, 2020 5:27 pm 
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3023
Location: Cologne, Germany
PDFsharp was not designed to repair corrupt or non-standard PDF files.

Adobe Reader does a good job at fixing PDF files.
Does Adobe Reader prompt to save the file when you open it?
If not, try "Save as…" in Adobe Reader and check the XRef table again.

Thomas Hoevel
PDFsharp Team

Reply with quote  
PostPosted: Mon Feb 10, 2020 5:34 pm 

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 3
Adobe Reader opens the file just fine.

I did not try a save as and re-open. I can perhaps check that and wrap some logic up that does that for cases like these in an attempt to recover.

I understand this is an issue of "you cannot expect us to handle every possible corrupt file situation" .

I am uber focused on my own particular situation and this one particular file. I Have not researched further what other possible corrupt xRef tables may look like.

But in this case it seems to me a no-brainer : if we have an address of 0000000000 at an array index of anything but [0] why bother, just toss it and move on.

So that i suppose is really my question.

But thank you for the suggestion, i am going to try the "Save as" and then try parsing the file again with PdfSharp.


Reply with quote  
PostPosted: Mon Feb 10, 2020 5:41 pm 

Joined: Mon Feb 10, 2020 4:38 pm
Posts: 3
indeed a "Save as" rebuilds the xRef table and resolves this invalid entry.

So that is a possible solution for us if we can automate that process.

For the PDFSharp developers my question remains, if a position returns a 0 why not just toss it.

Perhaps your answer is the answer : because we do not consider at all any logic to resolve corrupt files ?

Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC

Who is online

Users browsing this forum: No registered users and 1 guest

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group