PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Compress images in an existing PDF?
https://forum.pdfsharp.net/viewtopic.php?f=2&t=3486
Page 1 of 1

Author:  hattonjohn [ Thu Nov 03, 2016 9:08 pm ]
Post subject:  Compress images in an existing PDF?

Hi,
The open source windows/linux book-making software, bloomlibrary.org, uses an embedded Firefox to make pdfs. Unfortunately the resulting PDFs are huge because Firefox saves images using only FlatDecode (zipped), so the PDFs are huge. If I run the pdf through ghostscript, it gets nicely compressed again, marked as DCTDecode.

However, we want to keep our installer small, and we already ship with PDFSharp. We don't want to add ghostscript.

Before we dive into this, should it be feasible to open the PDF with PDFSharp, walk through each image, compress it (in c#), and then put back in the compressed version? Any advice on how to approach that?

thanks
jh

Author:  phirewind [ Wed May 10, 2017 3:00 pm ]
Post subject:  Re: Compress images in an existing PDF?

Has anyone ever answered this question? I need to do the same thing. I have working code already that converts any image into a JPEG at a certain quality (I'm using 50%), but I've seen this same question asked over and over and there is never a response. "This is not possible with PDFSharp" is a valid answer, if that is the answer, and is much more helpful than silence.

Author:  phirewind [ Thu May 11, 2017 1:31 pm ]
Post subject:  Re: Compress images in an existing PDF?

To ask a more specific question: Here is a function adapted from the oft-repeated sample code parsing through all of the images in a PDF. Assume that I have already converted the images to JPG (another issue that is a separate question regarding the ExportToImage function and non-JPG images), so all I have to do is read the replacement files from disk and insert them in the right places. For example, a 20-page document that is just scanned paper pages.

The following code does that, and applies the changes to the xObject.Elements that matches what is in the PDF if it was created with JPG images. However, when I open the resulting PDF, I get the message "An error exists on this page. Acrobat may not display the page correctly", and all the pages are blank. The PDF file size looks like it has the right data (it's reduced from 22 mb to 3mb, matching one converted through other desktop applications) but it will not display. I'm assuming there are other steps to correct either the xObject or a Resources entry. Any ideas?

Code:
private static void ProcessImagesPDFSharp()
{
   PdfDocument pdf = PdfReader.Open(@"test\test.pdf");

   int imageCount = 0;
   // Iterate pages
   foreach (PdfPage page in pdf.Pages)
   {
      // Get resources dictionary
      PdfDictionary resources = page.Elements.GetDictionary("/Resources");
      if (resources != null)
      {
         // Get external objects dictionary
         PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
         if (xObjects != null)
         {
            ICollection<PdfItem> items = xObjects.Elements.Values;
            // Iterate references to external objects
            foreach (PdfItem item in items)
            {
               PdfReference reference = item as PdfReference;
               if (reference != null)
               {
                  PdfDictionary xObject = reference.Value as PdfDictionary;
                  // Is external object an image?
                  if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
                  {
                     // Replace this object with a JPG file
                     xObject.Stream.Value = File.ReadAllBytes($@"test\page {++imageCount}.jpg").ToArray();
                     xObject.Elements.SetValue("/Length", new PdfInteger(xObject.Stream.Value.Length));
                     xObject.Elements.SetValue("/ColorSpace", new PdfString("/DeviceRGB"));
                     xObject.Elements.SetValue("/Filter", new PdfString("/DCTDecode"));
                     xObject.Elements.SetValue("/Type", new PdfString("/XObject"));
                     xObject.Elements.Remove("/DecodeParams");
                  }
               }
            }
         }
      }
   }
   pdf.Save(@"test\out.pdf");
}

Author:  Thomas Hoevel [ Thu May 11, 2017 1:53 pm ]
Post subject:  Re: Compress images in an existing PDF?

Hi!

phirewind wrote:
"This is not possible with PDFSharp" is a valid answer, if that is the answer, and is much more helpful than silence.
PDFsharp is open source and "This is not possible with PDFsharp" is hardly ever a valid answer.
There could be a rather simple solution for files from scanner "Foo 9100" while a general solution will be much more complicated.

phirewind wrote:
However, when I open the resulting PDF, I get the message "An error exists on this page. Acrobat may not display the page correctly"
There is an error in a PDF file which we do not see. So you cannot expect more than speculation from us.
Maybe some other properties are incorrect ("/Width" or "/Height" or something else).

Author:  phirewind [ Thu May 11, 2017 2:07 pm ]
Post subject:  Re: Compress images in an existing PDF?

I will send the file via pm/email, as I had to redact certain proprietary information from the scanned document.

Author:  phirewind [ Thu May 11, 2017 3:16 pm ]
Post subject:  Re: Compress images in an existing PDF?

I was able to sufficiently redact the samples and create a single-page test. The "in.pdf" in this case is actually significantly smaller than the "out.pdf", but the first relevant issue is the ability to replace one image with another and maintain PDF integrity.

and btw thanks in advance for any assistance. I know it helps to have a very specific question to answer.

Attachments:
File comment: Contains in.pdf, out.pdf, and page 1.jpg
sample.zip [175.89 KiB]
Downloaded 399 times

Author:  hattonjohn [ Thu May 11, 2017 3:21 pm ]
Post subject:  Re: Compress images in an existing PDF?

www.pdf-online.com says:

File out.pdf
Compliance pdf1.4
Result Document does not conform to PDF/A.
Details
Validating file "out.pdf" for conformance level pdf1.4
The value of the key Type must not be of type string.
The value of the key Type is (null) but must be XObject.
The value of the key ColorSpace must not be of type string.
The image's sample stream's computed length 1053150 is different to the actual length 118121.
The color space is invalid.
The document does not conform to the requested standard.
The document doesn't conform to the PDF reference (missing required entries, wrong value types, etc.).
Done.

Author:  hattonjohn [ Thu May 11, 2017 3:34 pm ]
Post subject:  Re: Compress images in an existing PDF?

I don't know if any of the above pdf/a check is actually relevant... another product, pdfharmony, said

page: 001 Could not find the XObject named 'Im1';

pdfHarmony reported no errors with your in.pdf.

Author:  phirewind [ Thu May 11, 2017 3:44 pm ]
Post subject:  Re: Compress images in an existing PDF?

The first check may be relevant. I also forgot to correct the image size on the rebuilt sample and had misspelled "/DecodeParms", but I corrected those and had the same issue remain.

When I'm using xObject.Elements.SetValue, it is storing values as objects with a Value property, but not the same type of object (it adds other properties). I checked the values as they were applied in debug step-through to verify. But the PdfItem object doesn't appear to have an instantiator, so I can't use "new PdfItem('value')"; I may be chasing red herrings there, and it looks like other people use .SetValue in the same manner successfully, but that is a slightly curious thing. I can also see that it is storing the correct value in the Length key, but there must be some calculation Acrobat performs with the given info that comes up with the wrong estimate and chokes.

Author:  phirewind [ Fri May 26, 2017 2:05 pm ]
Post subject:  Re: Compress images in an existing PDF?

So, no ideas how to replace an image in PDFSharp without corrupting the PDF?

Author:  TH-Soft [ Fri May 26, 2017 6:32 pm ]
Post subject:  Re: Compress images in an existing PDF?

phirewind wrote:
So, no ideas how to replace an image in PDFSharp without corrupting the PDF?
As I understand it your code corrupts the PDF file by inserting incorrect values and/or values in incorrect formats.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/