PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Fixing PdfSharp to not load all objects on opening a PDF
https://forum.pdfsharp.net/viewtopic.php?f=3&t=3488
Page 1 of 1

Author:  Gerben Vos [ Sun Nov 06, 2016 2:59 pm ]
Post subject:  Fixing PdfSharp to not load all objects on opening a PDF

In our application, we use PdfSharp to open and read PDFs from all kinds of different sources. One major problem that has cropped up with many of these PDFs is that PdfSharp tries to read all objects in a PDF immediately when it opens one. We have found many PDFs that have objects in the xref table that don't actually exist, and the xref table entry points to the middle of some other object's data. Just opening these with PdfSharp gives an error. But Acrobat and other PDF viewers such as mupdf and GSview can open them without any problem.

Some example PDFs are: http://www.stillhq.com/pdfdb/000083/data.pdf and http://www.stillhq.com/pdfdb/000087/data.pdf .

As I already mentioned in another bug report, it is not clear to me why PdfSharp does this. PDF is designed to be easy to lazily load, so why not implement PdfSharp like that?

My questions here are:
1. Have the developers already fixed this in a newer version?
2. Do you know why it is implemented this way? Is there a technical reason why lazy load could not be implemented in PdfSharp? Which obstacles would you expect if we tried this?
3. If we would fix/implement this (which could mean a lot of changes), would you apply our patches to a new released version (if you think they are okay)?

Author:  Thomas Hoevel [ Mon Nov 07, 2016 4:51 pm ]
Post subject:  Re: Fixing PdfSharp to not load all objects on opening a PDF

Hi!
Gerben Vos wrote:
1. Have the developers already fixed this in a newer version?
It is a feature, not a bug.
PDFsharp was developed to deal with intact PDF files. And now we have problems reading corrupt PDF files.
It would be a major overhaul to PDFsharp compatible with most corrupt PDF files.

Gerben Vos wrote:
2. Do you know why it is implemented this way? Is there a technical reason why lazy load could not be implemented in PdfSharp? Which obstacles would you expect if we tried this?
a) It was developed and tested with clean and intact PDF files.
b) I don't think so. PDFsharp followed a different approach.
c) Lazy loading will lead to lazy exceptions. Many new problems may occur.

Gerben Vos wrote:
3. If we would fix/implement this (which could mean a lot of changes), would you apply our patches to a new released version (if you think they are okay)?
The hurdle will be convincing Stefan that the changes are OK.
Programs using PDFsharp may require many changes that deal with lazy exceptions.
You're proposing a breaking change with benefits and risks.

Author:  Gerben Vos [ Mon Nov 07, 2016 5:14 pm ]
Post subject:  Re: Fixing PdfSharp to not load all objects on opening a PDF

Thomas Hoevel wrote:
1. PDFsharp was developed to deal with intact PDF files. And now we have problems reading corrupt PDF files.
I am explicitly limiting this (at least, for now) to PDFs that Adobe Acrobat opens without complaint. (For most of these PDFs, the non-existing objects are also not referenced anywhere, so they really cannot cause any problem.) Therefore, many of our users, and even the writers of the software that created those PDFs, will see these PDFs as non-corrupt and it will be hard to explain to our users why our software cannot open them.

Thomas Hoevel wrote:
2c) Lazy loading will lead to lazy exceptions. Many new problems may occur.
If we decide to implement this, we will of course run it over our own test set and make sure that everything we encounter and that is fixable within PDFsharp is fixed. This should shake out the most important ones of these.

Thomas Hoevel wrote:
Programs using PDFsharp may require many changes that deal with lazy exceptions.
You're proposing a breaking change with benefits and risks.
Yes, this is why I thought it was wise to ask you first. :)

Indeed, this may cause problems for programs using PDFsharp. One possible idea is to add this as an option: by default, read all objects, but allow it to be turned off if you require it. Then it remains a matter of how many PDFsharp users actually need this (and how many maintenance problems it creates).

However, if properly implemented, I think this could greatly improve PDFsharp's quality.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/