PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Wed Aug 21, 2024 10:30 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 8 posts ] 
Author Message
PostPosted: Fri Jan 29, 2016 10:04 am 
Offline

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5
Hello

We are using PDFsharp to extract single pages out of a larger document. It seems like the single page retains some streams (or other objects) that were used in the larger document. The PDF containing the single page has approximately the same size as the original PDF (which contained ~1400 pages, each page has one image (different image on each page) and some text).

So for example the original PDF has 1400 pages and is 11 MB in size.
A single, extracted page is still 11 MB in size.

If we use an optimizer, we can bring down the file size to about 120 KB or so.

Is there any way to clean up an extracted page to prevent this?

edit: for the record, the problem exists both with release and debug builds
edit2: upon further research, the problem seems to be that all images (from all pages) are preserved, the outputdoc.Internals.AllObjects dictionary contains exactly as many entries that seem to be an image as there are pages:
Code:
[38] = "dictionary(id=(39 0),[9])=key=9:(/Type /Subtype /Name /Width /Height /BitsPerComponent /ColorSpace /Length /Filter)"


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 01, 2016 4:33 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3109
Location: Cologne, Germany
Hi!
tcc wrote:
We are using PDFsharp to extract single pages out of a larger document.
You don't show any code. Doing the extraction in a different way may lead to smaller files.

AFAIK PDFsharp removes unreachable objects from the PDF file - at least when opening the file.
So it may lead to smaller files if you open the file with the extracted page for modification and save it again (maybe using a different name to see if there is a difference).

The code to remove unused objects is there, but maybe it is not invoked in all scenarios.
A VS solution that allows us to replicate the problem would help.
viewtopic.php?f=2&t=832

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 02, 2016 6:48 am 
Offline

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5
Hi Thomas

I've created a sample application that shows the problem. The original file "Lohnausweise.pdf" (don't worry, these are just test cases, no real data :wink: ) is about 1.2MB in size. Each extracted page results in a file size of roughly the same (1.1MB).

Resaving the file using the following code:

Code:
using (var pdfInputDoc = PdfReader.Open(tmp, PdfDocumentOpenMode.Modify))
            {
                pdfInputDoc.Save(outputPath);
            }


does not help either.

Really appreciate your help, thanks.

The file is too large for the forums, so here is the link to my dropbox: https://www.dropbox.com/s/3vkpb84h4n8vr ... t.zip?dl=0


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 24, 2016 1:17 pm 
Offline

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5
Hi Thomas

Did you have time to look at this issue?


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 24, 2016 1:46 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3109
Location: Cologne, Germany
Hi!

Sorry for the late response, I must have missed your post.

There are many fonts in the PDF file. Each page only has a few hundred bytes for the individual text, all pages share about 1 MB of embedded fonts.
And therefore each extracted page is about 1 MB in size.

Attachment:
forum_16-02-24.png
forum_16-02-24.png [ 27.23 KiB | Viewed 9801 times ]

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 24, 2016 2:23 pm 
Offline

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5
Are you sure about this?

We see the same problem with a much larger (read: about 600 pages of tax accounting forms) document, with the same fonts. However, each page is now not only 1 MB but about 10 MB (with the original file being close to 11 MB in size).

Sorry to nag, but I'm pretty sure it's not the fonts.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 24, 2016 3:45 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3109
Location: Cologne, Germany
Well, I was wrong.

The PDF with a single page contains all the fonts and all the barcode images. The fonts make the majority for the file you gave me.

Problem is: the master PDF file contains a single list with all the images. And each page refers to the complete list, therefore PDFsharp includes all images.

If the PDF file had a different structure, listing only the required images for each page, then PDFsharp would create smaller files with only the required images.
PDFsharp does not analyze which of the images listed as "/Resources" are actually used on that page. Such an analysis is not trivial.

So you will need a PDF compressor that makes this analysis and removes the orphaned images.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 24, 2016 3:54 pm 
Offline

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5
Ok, so PDFsharp will not do that currently and support for this (I guess not so widely used feature) is not something you are actively working on?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 92 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group