PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

PDF file size when extracting pages
https://forum.pdfsharp.net/viewtopic.php?f=2&t=3281
Page 1 of 1

Author:  tcc [ Fri Jan 29, 2016 10:04 am ]
Post subject:  PDF file size when extracting pages

Hello

We are using PDFsharp to extract single pages out of a larger document. It seems like the single page retains some streams (or other objects) that were used in the larger document. The PDF containing the single page has approximately the same size as the original PDF (which contained ~1400 pages, each page has one image (different image on each page) and some text).

So for example the original PDF has 1400 pages and is 11 MB in size.
A single, extracted page is still 11 MB in size.

If we use an optimizer, we can bring down the file size to about 120 KB or so.

Is there any way to clean up an extracted page to prevent this?

edit: for the record, the problem exists both with release and debug builds
edit2: upon further research, the problem seems to be that all images (from all pages) are preserved, the outputdoc.Internals.AllObjects dictionary contains exactly as many entries that seem to be an image as there are pages:
Code:
[38] = "dictionary(id=(39 0),[9])=key=9:(/Type /Subtype /Name /Width /Height /BitsPerComponent /ColorSpace /Length /Filter)"

Author:  Thomas Hoevel [ Mon Feb 01, 2016 4:33 pm ]
Post subject:  Re: PDF file size when extracting pages

Hi!
tcc wrote:
We are using PDFsharp to extract single pages out of a larger document.
You don't show any code. Doing the extraction in a different way may lead to smaller files.

AFAIK PDFsharp removes unreachable objects from the PDF file - at least when opening the file.
So it may lead to smaller files if you open the file with the extracted page for modification and save it again (maybe using a different name to see if there is a difference).

The code to remove unused objects is there, but maybe it is not invoked in all scenarios.
A VS solution that allows us to replicate the problem would help.
viewtopic.php?f=2&t=832

Author:  tcc [ Tue Feb 02, 2016 6:48 am ]
Post subject:  Re: PDF file size when extracting pages

Hi Thomas

I've created a sample application that shows the problem. The original file "Lohnausweise.pdf" (don't worry, these are just test cases, no real data :wink: ) is about 1.2MB in size. Each extracted page results in a file size of roughly the same (1.1MB).

Resaving the file using the following code:

Code:
using (var pdfInputDoc = PdfReader.Open(tmp, PdfDocumentOpenMode.Modify))
            {
                pdfInputDoc.Save(outputPath);
            }


does not help either.

Really appreciate your help, thanks.

The file is too large for the forums, so here is the link to my dropbox: https://www.dropbox.com/s/3vkpb84h4n8vr ... t.zip?dl=0

Author:  tcc [ Wed Feb 24, 2016 1:17 pm ]
Post subject:  Re: PDF file size when extracting pages

Hi Thomas

Did you have time to look at this issue?

Author:  Thomas Hoevel [ Wed Feb 24, 2016 1:46 pm ]
Post subject:  Re: PDF file size when extracting pages

Hi!

Sorry for the late response, I must have missed your post.

There are many fonts in the PDF file. Each page only has a few hundred bytes for the individual text, all pages share about 1 MB of embedded fonts.
And therefore each extracted page is about 1 MB in size.

Attachment:
forum_16-02-24.png
forum_16-02-24.png [ 27.23 KiB | Viewed 9814 times ]

Author:  tcc [ Wed Feb 24, 2016 2:23 pm ]
Post subject:  Re: PDF file size when extracting pages

Are you sure about this?

We see the same problem with a much larger (read: about 600 pages of tax accounting forms) document, with the same fonts. However, each page is now not only 1 MB but about 10 MB (with the original file being close to 11 MB in size).

Sorry to nag, but I'm pretty sure it's not the fonts.

Author:  Thomas Hoevel [ Wed Feb 24, 2016 3:45 pm ]
Post subject:  Re: PDF file size when extracting pages

Well, I was wrong.

The PDF with a single page contains all the fonts and all the barcode images. The fonts make the majority for the file you gave me.

Problem is: the master PDF file contains a single list with all the images. And each page refers to the complete list, therefore PDFsharp includes all images.

If the PDF file had a different structure, listing only the required images for each page, then PDFsharp would create smaller files with only the required images.
PDFsharp does not analyze which of the images listed as "/Resources" are actually used on that page. Such an analysis is not trivial.

So you will need a PDF compressor that makes this analysis and removes the orphaned images.

Author:  tcc [ Wed Feb 24, 2016 3:54 pm ]
Post subject:  Re: PDF file size when extracting pages

Ok, so PDFsharp will not do that currently and support for this (I guess not so widely used feature) is not something you are actively working on?

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/