PDFsharp & MigraDoc Foundation • View topic - PDF file size when extracting pages

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

PDF file size when extracting pages

Moderator: Stefan Lange

Page 1 of 1

[ 8 posts ]

Print view

Previous topic | Next topic

Author

Message

tcc

Post subject: PDF file size when extracting pages

Posted: Fri Jan 29, 2016 10:04 am

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5

Hello

We are using PDFsharp to extract single pages out of a larger document. It seems like the single page retains some streams (or other objects) that were used in the larger document. The PDF containing the single page has approximately the same size as the original PDF (which contained ~1400 pages, each page has one image (different image on each page) and some text).

So for example the original PDF has 1400 pages and is 11 MB in size.
A single, extracted page is still 11 MB in size.

If we use an optimizer, we can bring down the file size to about 120 KB or so.

Is there any way to clean up an extracted page to prevent this?

edit: for the record, the problem exists both with release and debug builds
edit2: upon further research, the problem seems to be that all images (from all pages) are preserved, the outputdoc.Internals.AllObjects dictionary contains exactly as many entries that seem to be an image as there are pages:

Code:

[38] = "dictionary(id=(39 0),[9])=key=9:(/Type /Subtype /Name /Width /Height /BitsPerComponent /ColorSpace /Length /Filter)"

Top

Thomas Hoevel

Post subject: Re: PDF file size when extracting pages

Posted: Mon Feb 01, 2016 4:33 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3109
Location: Cologne, Germany

Hi!

tcc wrote:

We are using PDFsharp to extract single pages out of a larger document.

You don't show any code. Doing the extraction in a different way may lead to smaller files.

AFAIK PDFsharp removes unreachable objects from the PDF file - at least when opening the file.
So it may lead to smaller files if you open the file with the extracted page for modification and save it again (maybe using a different name to see if there is a difference).

The code to remove unused objects is there, but maybe it is not invoked in all scenarios.
A VS solution that allows us to replicate the problem would help.
viewtopic.php?f=2&t=832

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

tcc

Post subject: Re: PDF file size when extracting pages

Posted: Tue Feb 02, 2016 6:48 am

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5

Hi Thomas

I've created a sample application that shows the problem. The original file "Lohnausweise.pdf" (don't worry, these are just test cases, no real data :wink:

) is about 1.2MB in size. Each extracted page results in a file size of roughly the same (1.1MB).

Resaving the file using the following code:

Code:

using (var pdfInputDoc = PdfReader.Open(tmp, PdfDocumentOpenMode.Modify))
            {
                pdfInputDoc.Save(outputPath);
            }

does not help either.

Really appreciate your help, thanks.

The file is too large for the forums, so here is the link to my dropbox: https://www.dropbox.com/s/3vkpb84h4n8vr ... t.zip?dl=0

Top

tcc

Post subject: Re: PDF file size when extracting pages

Posted: Wed Feb 24, 2016 1:17 pm

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5

Hi Thomas

Did you have time to look at this issue?

Top

Thomas Hoevel

Post subject: Re: PDF file size when extracting pages

Posted: Wed Feb 24, 2016 1:46 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3109
Location: Cologne, Germany

Hi!

Sorry for the late response, I must have missed your post.

There are many fonts in the PDF file. Each page only has a few hundred bytes for the individual text, all pages share about 1 MB of embedded fonts.
And therefore each extracted page is about 1 MB in size.

Attachment:

forum_16-02-24.png [ 27.23 KiB | Viewed 9801 times ]

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

tcc

Post subject: Re: PDF file size when extracting pages

Posted: Wed Feb 24, 2016 2:23 pm

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5

Are you sure about this?

We see the same problem with a much larger (read: about 600 pages of tax accounting forms) document, with the same fonts. However, each page is now not only 1 MB but about 10 MB (with the original file being close to 11 MB in size).

Sorry to nag, but I'm pretty sure it's not the fonts.

Top

Thomas Hoevel

Post subject: Re: PDF file size when extracting pages

Posted: Wed Feb 24, 2016 3:45 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3109
Location: Cologne, Germany

Well, I was wrong.

The PDF with a single page contains all the fonts and all the barcode images. The fonts make the majority for the file you gave me.

Problem is: the master PDF file contains a single list with all the images. And each page refers to the complete list, therefore PDFsharp includes all images.

If the PDF file had a different structure, listing only the required images for each page, then PDFsharp would create smaller files with only the required images.
PDFsharp does not analyze which of the images listed as "/Resources" are actually used on that page. Such an analysis is not trivial.

So you will need a PDF compressor that makes this analysis and removes the orphaned images.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

tcc

Post subject: Re: PDF file size when extracting pages

Posted: Wed Feb 24, 2016 3:54 pm

Joined: Fri Jan 29, 2016 9:59 am
Posts: 5

Ok, so PDFsharp will not do that currently and support for this (I guess not so widely used feature) is not something you are actively working on?

Top

Page 1 of 1

[ 8 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: No registered users and 92 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum