PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sun Apr 22, 2018 6:31 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 15 posts ] 
Author Message
 Post subject: PDF XRef indexing issue
PostPosted: Tue Mar 13, 2018 11:10 am 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
Hi. I haven't been able to find an answer to my question.

Note the following XRef and trailer of a PDF document I have:
xref
1 31
0000000000 65535 f
0000000009 00000 n
0000127162 00000 n
0000127259 00000 n
0003029559 00000 n
0000127446 00000 n
0000530308 00000 n
0000530405 00000 n
0000530592 00000 n
0001027399 00000 n
0001027496 00000 n
0001027684 00000 n
0001534178 00000 n
0001534276 00000 n
0001534466 00000 n
0002043553 00000 n
0002043651 00000 n
0002043841 00000 n
0002544539 00000 n
0002544637 00000 n
0002544827 00000 n
0002642816 00000 n
0002642914 00000 n
0002643104 00000 n
0002913427 00000 n
0002913525 00000 n
0002913715 00000 n
0003029271 00000 n
0003029369 00000 n
0003029672 00000 n
0003029724 00000 n
trailer
<<
/Size 31
/Root 29 0 R
/Info 30 0 R
>>
startxref
3029821
%%EOF

The byte references are correct, and the position object ID's are also correct, but I have done a lot of debugging, and it seems that because the XRef is 1-indexed, the /Root and /Info references reference an object ID of their object ID + 1. So the info is coming from object 31 0 R and the root is coming from 30 0 R. As a result, the document returns no pages, and I get a null exception in the GetKids method. I unfortunately can't send the PDF document as it contains sensitive information, and I'm not sure how a PDF like this is created.

Are there any suggestions on how to fix this?


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 12:20 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
Hi!
Mrcloc wrote:
Are there any suggestions on how to fix this?
Open with Adobe Reader, use "Save as" and check for differences.
Is the file corrupt or does PDFsharp handle it incorrectly? Do we have to fix the file or PDFsharp?

The data for "/Info 30 0 R" should come from an object starting with "30 0 obj" in the PDF file.
Your XRef table is 1-based, so object 30 should be at offset "0003029672 00000 n" if I'm not mistaken.

Adobe PDF Reference wrote:
The cross-reference table (comprising the original cross-reference section and all
update sections) must contain one entry for each object number from 0 to the
maximum object number used in the file, even if one or more of the object numbers
in this range do not actually occur in the file.
Is there another XRef table that supplies an entry for object 0? Probably not ...

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 12:52 pm 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
Hi Thomas. Thank you for the reply.

30 0 obj is at offset "0003029724 00000 n"

The file is fine - I think there is a mopier which scans the documents like this, but I don't see anything wrong with the file.

I have done a lot of debugging of the PDFSharp code, and I came to the conclusion that it is using the xref offset directly, rather than to account for the 1-indexing. I have illustrated this below:

Current mappings (xref)
Code:
2 0 R  -> 0000000009 00000 n
3 0 R  -> 0000127162 00000 n
4 0 R  -> 0000127259 00000 n
5 0 R  -> 0003029559 00000 n
.
.
.
29 0 R -> 0003029369 00000 n
30 0 R -> 0003029672 00000 n
31 0 R -> 0003029724 00000 n


But it should be
Code:
1 0 R  -> 0000000009 00000 n
2 0 R  -> 0000127162 00000 n
3 0 R  -> 0000127259 00000 n
4 0 R  -> 0003029559 00000 n
.
.
.
28 0 R -> 0003029369 00000 n
29 0 R -> 0003029672 00000 n
30 0 R -> 0003029724 00000 n


Because byte offset 1 is the zero (free) object.

So perhaps there is a hard-coded 0 in the PDFSharp library, which assumes the xref range to be 0-x? Or is it a problem with the PDF document?

If I change the xref to 0 30 then I get an error "Unexpected token 'n' in PDF stream. The file may be corrupted." If I change the Root to 28 0 R and the Info to 29 0 R I get the correct information in those fields, but the incorrect object IDs.


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 1:04 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
Mrcloc wrote:
If I change the xref to 0 30 then I get an error "Unexpected token 'n' in PDF stream. The file may be corrupted."
"0 30" is wrong, "0 31" would fit sizewise, but other entries could be wrong.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 1:14 pm 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
Hi. Thank you. I was just going to write that if I change it to 0 31, everything is fine.

Is this something which needs to be changed in PDFSharp?


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 1:31 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
Mrcloc wrote:
Is this something which needs to be changed in PDFsharp?
How can I know without a file to test it with?
Maybe yes, maybe no.
If it has to be changed in PDFsharp then it most likely is a simple fix.

The files come from a scanner - so it should be possible to scan a non-confidential test page.
Maybe we already have test files in one of the similar threads. I cannot search today, but will check this eventually.
Thanks for your feedback.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 1:46 pm 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
I don't have a test file yet, but I am currently arranging one. How can I send it?


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 2:40 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
Mrcloc wrote:
How can I send it?
Zip it and upload it here if it is smaller than 250 kiB.
I can PM you an e-mail address if the file is larger.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 13, 2018 3:45 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
Mrcloc wrote:
Current mappings (xref)
Code:
2 0 R  -> 0000000009 00000 n
3 0 R  -> 0000127162 00000 n
4 0 R  -> 0000127259 00000 n
5 0 R  -> 0003029559 00000 n
.
.
.
29 0 R -> 0003029369 00000 n
30 0 R -> 0003029672 00000 n
31 0 R -> 0003029724 00000 n
As I understand it this is how the PDF file must be interpreted. So the XRef table is meant to be "0 31", but it is declared as "1 31", indicating that there is on object #1.

I'm pretty sure the PDF file is faulty.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 14, 2018 2:11 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
A PDF file with such a faulty XRef table can be found in this thread:
viewtopic.php?p=9953#p9953

Back then I tried in vain to understand what was going wrong.
As a result of the offset of the XRef table PDFsharp was not using the objects I thought it was using, and therefore it was not seeing the pages I was seeing.
One mystery solved.
Now I can try to make PDFsharp show a meaningful error message - or maybe even open the file despite the corrupt XRef table.

Mrcloc wrote:
I was just going to write that if I change it to 0 31, everything is fine.
You saw what I didn't see. Thanks for your help.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 15, 2018 10:05 am 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
Hi Thomas, thank you for the replies. I have finally managed to obtain a content-safe test document. Thanks for the work you put into this.

I did a quick and dirty workaround for this, which should work for me since the source of the documents is known. I don't know if it's the most wise thing to do, but I don't believe my code to be any worse off like this. Basically, if the XRef starts with 1, I modify that byte to b 0.

Code:
public static PdfSharp.Pdf.PdfDocument OpenPdfDocument(byte[] file, PdfDocumentOpenMode openMode = PdfDocumentOpenMode.Modify)
{
   try
   {
      return PdfReader.Open(new MemoryStream(file), openMode);
   }
   catch (Exception ex)
   {
      try
      {
         string fileText = Encoding.Default.GetString(file.Where(x => x != 0).ToArray());
         if (!fileText.Contains("startxref"))
         {
            throw ex;
         }
         string startxrefContainer = fileText.Substring(fileText.IndexOf("startxref")); // Need to read to EOF because the number of bytes from this index to the end is not predictable
         long xrefAddress = 0;
         if (!long.TryParse(Regex.Match(startxrefContainer, "[0-9]+").ToString(), out xrefAddress))
         {
            throw ex;
         }
         string xrefContainer = "";
         for (long i = xrefAddress; i < xrefAddress + 20; i++) // Read the next 20 bytes - random choice that only needs to be big enough (and that 20 should actually be a variable :/)
         {
            xrefContainer += (Convert.ToChar(file[i])).ToString();
         }
         if (Regex.Match(xrefContainer, "[0-9]+ ").ToString() == "0 ")
         {
            throw ex;
         }
         Regex regex = new Regex("[0-9]+ ");
         xrefContainer = regex.Replace(xrefContainer, "0 ", 1);
         for (long i = xrefAddress; i < xrefAddress + 20; i++)
         {
            file[i] = (byte)xrefContainer[(int)(i - xrefAddress)];
         }
         return PdfReader.Open(new MemoryStream(file), openMode);
      }
      catch
      {
         throw ex;
      }
   }
}


I should read more about the PDF format, and maybe I can find information on that 1-indexed XRef table, but it hasn't been a quick search to find anything on that. I will do some better searching when I have a chance. For now, I hope the attached document helps.


Attachments:
test doc.zip [106.56 KiB]
Downloaded 20 times
Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 15, 2018 10:09 am 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
And the xrefContainer should actually be a byte[], just to take care of nulls (0).


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 15, 2018 10:34 am 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
Here is a much better way. This is from where xrefContainer is declared (no need for it anymore) to the return statement.

Code:
for (long i = xrefAddress; i < xrefAddress + 20; i++)
{
   if (file[i] == 48) // If it's 0, there's another problem
   {
      throw ex;
   }
   if (file[i] > 48 && file[i] < 58) // Find the first numeric
   {
      if (file[i+1] != 32) // Only handle single digits
      {
         throw ex;
      }
      file[i] = 48; // Set the start of the XRef to 0
      break;
   }
}
return PdfReader.Open(new MemoryStream(file), openMode);


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 15, 2018 11:08 am 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2807
Location: Cologne, Germany
Mrcloc wrote:
I did a quick and dirty workaround for this
I haven't tried your file yet, but the PDF from the other thread can now be opened with the PDFsharp version that was published yesterday evening.
So most likely the workaround is no longer needed (unless you encounter any regressions with the latest version).

The new implementation makes a plausibility check on the XRef table and corrects the known off-by-one error and throws a meaningful exception for other anomalies.
viewtopic.php?p=11491&f=2#p11491

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu Mar 15, 2018 11:11 am 
Offline

Joined: Tue Mar 13, 2018 11:04 am
Posts: 8
Thank you - I will have a look. It's quite important that I can manipulate these documents, so my error handling is already fine, but I needed a way to be able to use the Modify or Import PdfDocumentOpenModes.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group