PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

PDF XRef indexing issue
https://forum.pdfsharp.net/viewtopic.php?f=2&t=3742
Page 1 of 1

Author:  Mrcloc [ Tue Mar 13, 2018 11:10 am ]
Post subject:  PDF XRef indexing issue

Hi. I haven't been able to find an answer to my question.

Note the following XRef and trailer of a PDF document I have:
xref
1 31
0000000000 65535 f
0000000009 00000 n
0000127162 00000 n
0000127259 00000 n
0003029559 00000 n
0000127446 00000 n
0000530308 00000 n
0000530405 00000 n
0000530592 00000 n
0001027399 00000 n
0001027496 00000 n
0001027684 00000 n
0001534178 00000 n
0001534276 00000 n
0001534466 00000 n
0002043553 00000 n
0002043651 00000 n
0002043841 00000 n
0002544539 00000 n
0002544637 00000 n
0002544827 00000 n
0002642816 00000 n
0002642914 00000 n
0002643104 00000 n
0002913427 00000 n
0002913525 00000 n
0002913715 00000 n
0003029271 00000 n
0003029369 00000 n
0003029672 00000 n
0003029724 00000 n
trailer
<<
/Size 31
/Root 29 0 R
/Info 30 0 R
>>
startxref
3029821
%%EOF

The byte references are correct, and the position object ID's are also correct, but I have done a lot of debugging, and it seems that because the XRef is 1-indexed, the /Root and /Info references reference an object ID of their object ID + 1. So the info is coming from object 31 0 R and the root is coming from 30 0 R. As a result, the document returns no pages, and I get a null exception in the GetKids method. I unfortunately can't send the PDF document as it contains sensitive information, and I'm not sure how a PDF like this is created.

Are there any suggestions on how to fix this?

Author:  Thomas Hoevel [ Tue Mar 13, 2018 12:20 pm ]
Post subject:  Re: PDF XRef indexing issue

Hi!
Mrcloc wrote:
Are there any suggestions on how to fix this?
Open with Adobe Reader, use "Save as" and check for differences.
Is the file corrupt or does PDFsharp handle it incorrectly? Do we have to fix the file or PDFsharp?

The data for "/Info 30 0 R" should come from an object starting with "30 0 obj" in the PDF file.
Your XRef table is 1-based, so object 30 should be at offset "0003029672 00000 n" if I'm not mistaken.

Adobe PDF Reference wrote:
The cross-reference table (comprising the original cross-reference section and all
update sections) must contain one entry for each object number from 0 to the
maximum object number used in the file, even if one or more of the object numbers
in this range do not actually occur in the file.
Is there another XRef table that supplies an entry for object 0? Probably not ...

Author:  Mrcloc [ Tue Mar 13, 2018 12:52 pm ]
Post subject:  Re: PDF XRef indexing issue

Hi Thomas. Thank you for the reply.

30 0 obj is at offset "0003029724 00000 n"

The file is fine - I think there is a mopier which scans the documents like this, but I don't see anything wrong with the file.

I have done a lot of debugging of the PDFSharp code, and I came to the conclusion that it is using the xref offset directly, rather than to account for the 1-indexing. I have illustrated this below:

Current mappings (xref)
Code:
2 0 R  -> 0000000009 00000 n
3 0 R  -> 0000127162 00000 n
4 0 R  -> 0000127259 00000 n
5 0 R  -> 0003029559 00000 n
.
.
.
29 0 R -> 0003029369 00000 n
30 0 R -> 0003029672 00000 n
31 0 R -> 0003029724 00000 n


But it should be
Code:
1 0 R  -> 0000000009 00000 n
2 0 R  -> 0000127162 00000 n
3 0 R  -> 0000127259 00000 n
4 0 R  -> 0003029559 00000 n
.
.
.
28 0 R -> 0003029369 00000 n
29 0 R -> 0003029672 00000 n
30 0 R -> 0003029724 00000 n


Because byte offset 1 is the zero (free) object.

So perhaps there is a hard-coded 0 in the PDFSharp library, which assumes the xref range to be 0-x? Or is it a problem with the PDF document?

If I change the xref to 0 30 then I get an error "Unexpected token 'n' in PDF stream. The file may be corrupted." If I change the Root to 28 0 R and the Info to 29 0 R I get the correct information in those fields, but the incorrect object IDs.

Author:  Thomas Hoevel [ Tue Mar 13, 2018 1:04 pm ]
Post subject:  Re: PDF XRef indexing issue

Mrcloc wrote:
If I change the xref to 0 30 then I get an error "Unexpected token 'n' in PDF stream. The file may be corrupted."
"0 30" is wrong, "0 31" would fit sizewise, but other entries could be wrong.

Author:  Mrcloc [ Tue Mar 13, 2018 1:14 pm ]
Post subject:  Re: PDF XRef indexing issue

Hi. Thank you. I was just going to write that if I change it to 0 31, everything is fine.

Is this something which needs to be changed in PDFSharp?

Author:  Thomas Hoevel [ Tue Mar 13, 2018 1:31 pm ]
Post subject:  Re: PDF XRef indexing issue

Mrcloc wrote:
Is this something which needs to be changed in PDFsharp?
How can I know without a file to test it with?
Maybe yes, maybe no.
If it has to be changed in PDFsharp then it most likely is a simple fix.

The files come from a scanner - so it should be possible to scan a non-confidential test page.
Maybe we already have test files in one of the similar threads. I cannot search today, but will check this eventually.
Thanks for your feedback.

Author:  Mrcloc [ Tue Mar 13, 2018 1:46 pm ]
Post subject:  Re: PDF XRef indexing issue

I don't have a test file yet, but I am currently arranging one. How can I send it?

Author:  Thomas Hoevel [ Tue Mar 13, 2018 2:40 pm ]
Post subject:  Re: PDF XRef indexing issue

Mrcloc wrote:
How can I send it?
Zip it and upload it here if it is smaller than 250 kiB.
I can PM you an e-mail address if the file is larger.

Author:  Thomas Hoevel [ Tue Mar 13, 2018 3:45 pm ]
Post subject:  Re: PDF XRef indexing issue

Mrcloc wrote:
Current mappings (xref)
Code:
2 0 R  -> 0000000009 00000 n
3 0 R  -> 0000127162 00000 n
4 0 R  -> 0000127259 00000 n
5 0 R  -> 0003029559 00000 n
.
.
.
29 0 R -> 0003029369 00000 n
30 0 R -> 0003029672 00000 n
31 0 R -> 0003029724 00000 n
As I understand it this is how the PDF file must be interpreted. So the XRef table is meant to be "0 31", but it is declared as "1 31", indicating that there is on object #1.

I'm pretty sure the PDF file is faulty.

Author:  Thomas Hoevel [ Wed Mar 14, 2018 2:11 pm ]
Post subject:  Re: PDF XRef indexing issue

A PDF file with such a faulty XRef table can be found in this thread:
viewtopic.php?p=9953#p9953

Back then I tried in vain to understand what was going wrong.
As a result of the offset of the XRef table PDFsharp was not using the objects I thought it was using, and therefore it was not seeing the pages I was seeing.
One mystery solved.
Now I can try to make PDFsharp show a meaningful error message - or maybe even open the file despite the corrupt XRef table.

Mrcloc wrote:
I was just going to write that if I change it to 0 31, everything is fine.
You saw what I didn't see. Thanks for your help.

Author:  Mrcloc [ Thu Mar 15, 2018 10:05 am ]
Post subject:  Re: PDF XRef indexing issue

Hi Thomas, thank you for the replies. I have finally managed to obtain a content-safe test document. Thanks for the work you put into this.

I did a quick and dirty workaround for this, which should work for me since the source of the documents is known. I don't know if it's the most wise thing to do, but I don't believe my code to be any worse off like this. Basically, if the XRef starts with 1, I modify that byte to b 0.

Code:
public static PdfSharp.Pdf.PdfDocument OpenPdfDocument(byte[] file, PdfDocumentOpenMode openMode = PdfDocumentOpenMode.Modify)
{
   try
   {
      return PdfReader.Open(new MemoryStream(file), openMode);
   }
   catch (Exception ex)
   {
      try
      {
         string fileText = Encoding.Default.GetString(file.Where(x => x != 0).ToArray());
         if (!fileText.Contains("startxref"))
         {
            throw ex;
         }
         string startxrefContainer = fileText.Substring(fileText.IndexOf("startxref")); // Need to read to EOF because the number of bytes from this index to the end is not predictable
         long xrefAddress = 0;
         if (!long.TryParse(Regex.Match(startxrefContainer, "[0-9]+").ToString(), out xrefAddress))
         {
            throw ex;
         }
         string xrefContainer = "";
         for (long i = xrefAddress; i < xrefAddress + 20; i++) // Read the next 20 bytes - random choice that only needs to be big enough (and that 20 should actually be a variable :/)
         {
            xrefContainer += (Convert.ToChar(file[i])).ToString();
         }
         if (Regex.Match(xrefContainer, "[0-9]+ ").ToString() == "0 ")
         {
            throw ex;
         }
         Regex regex = new Regex("[0-9]+ ");
         xrefContainer = regex.Replace(xrefContainer, "0 ", 1);
         for (long i = xrefAddress; i < xrefAddress + 20; i++)
         {
            file[i] = (byte)xrefContainer[(int)(i - xrefAddress)];
         }
         return PdfReader.Open(new MemoryStream(file), openMode);
      }
      catch
      {
         throw ex;
      }
   }
}


I should read more about the PDF format, and maybe I can find information on that 1-indexed XRef table, but it hasn't been a quick search to find anything on that. I will do some better searching when I have a chance. For now, I hope the attached document helps.

Attachments:
test doc.zip [106.56 KiB]
Downloaded 496 times

Author:  Mrcloc [ Thu Mar 15, 2018 10:09 am ]
Post subject:  Re: PDF XRef indexing issue

And the xrefContainer should actually be a byte[], just to take care of nulls (0).

Author:  Mrcloc [ Thu Mar 15, 2018 10:34 am ]
Post subject:  Re: PDF XRef indexing issue

Here is a much better way. This is from where xrefContainer is declared (no need for it anymore) to the return statement.

Code:
for (long i = xrefAddress; i < xrefAddress + 20; i++)
{
   if (file[i] == 48) // If it's 0, there's another problem
   {
      throw ex;
   }
   if (file[i] > 48 && file[i] < 58) // Find the first numeric
   {
      if (file[i+1] != 32) // Only handle single digits
      {
         throw ex;
      }
      file[i] = 48; // Set the start of the XRef to 0
      break;
   }
}
return PdfReader.Open(new MemoryStream(file), openMode);

Author:  Thomas Hoevel [ Thu Mar 15, 2018 11:08 am ]
Post subject:  Re: PDF XRef indexing issue

Mrcloc wrote:
I did a quick and dirty workaround for this
I haven't tried your file yet, but the PDF from the other thread can now be opened with the PDFsharp version that was published yesterday evening.
So most likely the workaround is no longer needed (unless you encounter any regressions with the latest version).

The new implementation makes a plausibility check on the XRef table and corrects the known off-by-one error and throws a meaningful exception for other anomalies.
viewtopic.php?p=11491&f=2#p11491

Author:  Mrcloc [ Thu Mar 15, 2018 11:11 am ]
Post subject:  Re: PDF XRef indexing issue

Thank you - I will have a look. It's quite important that I can manipulate these documents, so my error handling is already fine, but I needed a way to be able to use the Modify or Import PdfDocumentOpenModes.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/