PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Fri Oct 20, 2017 9:25 pm

All times are UTC




Post new topic Reply to topic  [ 3 posts ] 
Author Message
PostPosted: Sun Nov 06, 2016 2:59 pm 
Offline

Joined: Tue Aug 02, 2016 9:56 am
Posts: 36
Location: Amsterdam, The Netherlands
In our application, we use PdfSharp to open and read PDFs from all kinds of different sources. One major problem that has cropped up with many of these PDFs is that PdfSharp tries to read all objects in a PDF immediately when it opens one. We have found many PDFs that have objects in the xref table that don't actually exist, and the xref table entry points to the middle of some other object's data. Just opening these with PdfSharp gives an error. But Acrobat and other PDF viewers such as mupdf and GSview can open them without any problem.

Some example PDFs are: http://www.stillhq.com/pdfdb/000083/data.pdf and http://www.stillhq.com/pdfdb/000087/data.pdf .

As I already mentioned in another bug report, it is not clear to me why PdfSharp does this. PDF is designed to be easy to lazily load, so why not implement PdfSharp like that?

My questions here are:
1. Have the developers already fixed this in a newer version?
2. Do you know why it is implemented this way? Is there a technical reason why lazy load could not be implemented in PdfSharp? Which obstacles would you expect if we tried this?
3. If we would fix/implement this (which could mean a lot of changes), would you apply our patches to a new released version (if you think they are okay)?

_________________
Gerben Vos
Developer, ZyLAB Technologies B.V.


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 07, 2016 4:51 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2720
Location: Cologne, Germany
Hi!
Gerben Vos wrote:
1. Have the developers already fixed this in a newer version?
It is a feature, not a bug.
PDFsharp was developed to deal with intact PDF files. And now we have problems reading corrupt PDF files.
It would be a major overhaul to PDFsharp compatible with most corrupt PDF files.

Gerben Vos wrote:
2. Do you know why it is implemented this way? Is there a technical reason why lazy load could not be implemented in PdfSharp? Which obstacles would you expect if we tried this?
a) It was developed and tested with clean and intact PDF files.
b) I don't think so. PDFsharp followed a different approach.
c) Lazy loading will lead to lazy exceptions. Many new problems may occur.

Gerben Vos wrote:
3. If we would fix/implement this (which could mean a lot of changes), would you apply our patches to a new released version (if you think they are okay)?
The hurdle will be convincing Stefan that the changes are OK.
Programs using PDFsharp may require many changes that deal with lazy exceptions.
You're proposing a breaking change with benefits and risks.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 07, 2016 5:14 pm 
Offline

Joined: Tue Aug 02, 2016 9:56 am
Posts: 36
Location: Amsterdam, The Netherlands
Thomas Hoevel wrote:
1. PDFsharp was developed to deal with intact PDF files. And now we have problems reading corrupt PDF files.
I am explicitly limiting this (at least, for now) to PDFs that Adobe Acrobat opens without complaint. (For most of these PDFs, the non-existing objects are also not referenced anywhere, so they really cannot cause any problem.) Therefore, many of our users, and even the writers of the software that created those PDFs, will see these PDFs as non-corrupt and it will be hard to explain to our users why our software cannot open them.

Thomas Hoevel wrote:
2c) Lazy loading will lead to lazy exceptions. Many new problems may occur.
If we decide to implement this, we will of course run it over our own test set and make sure that everything we encounter and that is fixable within PDFsharp is fixed. This should shake out the most important ones of these.

Thomas Hoevel wrote:
Programs using PDFsharp may require many changes that deal with lazy exceptions.
You're proposing a breaking change with benefits and risks.
Yes, this is why I thought it was wise to ask you first. :)

Indeed, this may cause problems for programs using PDFsharp. One possible idea is to add this as an option: by default, read all objects, but allow it to be turned off if you require it. Then it remains a matter of how many PDFsharp users actually need this (and how many maintenance problems it creates).

However, if properly implemented, I think this could greatly improve PDFsharp's quality.

_________________
Gerben Vos
Developer, ZyLAB Technologies B.V.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 3 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group