PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sat Oct 21, 2017 10:22 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 11 posts ] 
Author Message
PostPosted: Thu Nov 03, 2016 9:08 pm 
Offline

Joined: Thu Sep 29, 2011 1:39 pm
Posts: 5
Hi,
The open source windows/linux book-making software, bloomlibrary.org, uses an embedded Firefox to make pdfs. Unfortunately the resulting PDFs are huge because Firefox saves images using only FlatDecode (zipped), so the PDFs are huge. If I run the pdf through ghostscript, it gets nicely compressed again, marked as DCTDecode.

However, we want to keep our installer small, and we already ship with PDFSharp. We don't want to add ghostscript.

Before we dive into this, should it be feasible to open the PDF with PDFSharp, walk through each image, compress it (in c#), and then put back in the compressed version? Any advice on how to approach that?

thanks
jh


Top
 Profile  
Reply with quote  
PostPosted: Wed May 10, 2017 3:00 pm 
Offline

Joined: Wed May 10, 2017 2:35 pm
Posts: 8
Has anyone ever answered this question? I need to do the same thing. I have working code already that converts any image into a JPEG at a certain quality (I'm using 50%), but I've seen this same question asked over and over and there is never a response. "This is not possible with PDFSharp" is a valid answer, if that is the answer, and is much more helpful than silence.


Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 1:31 pm 
Offline

Joined: Wed May 10, 2017 2:35 pm
Posts: 8
To ask a more specific question: Here is a function adapted from the oft-repeated sample code parsing through all of the images in a PDF. Assume that I have already converted the images to JPG (another issue that is a separate question regarding the ExportToImage function and non-JPG images), so all I have to do is read the replacement files from disk and insert them in the right places. For example, a 20-page document that is just scanned paper pages.

The following code does that, and applies the changes to the xObject.Elements that matches what is in the PDF if it was created with JPG images. However, when I open the resulting PDF, I get the message "An error exists on this page. Acrobat may not display the page correctly", and all the pages are blank. The PDF file size looks like it has the right data (it's reduced from 22 mb to 3mb, matching one converted through other desktop applications) but it will not display. I'm assuming there are other steps to correct either the xObject or a Resources entry. Any ideas?

Code:
private static void ProcessImagesPDFSharp()
{
   PdfDocument pdf = PdfReader.Open(@"test\test.pdf");

   int imageCount = 0;
   // Iterate pages
   foreach (PdfPage page in pdf.Pages)
   {
      // Get resources dictionary
      PdfDictionary resources = page.Elements.GetDictionary("/Resources");
      if (resources != null)
      {
         // Get external objects dictionary
         PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
         if (xObjects != null)
         {
            ICollection<PdfItem> items = xObjects.Elements.Values;
            // Iterate references to external objects
            foreach (PdfItem item in items)
            {
               PdfReference reference = item as PdfReference;
               if (reference != null)
               {
                  PdfDictionary xObject = reference.Value as PdfDictionary;
                  // Is external object an image?
                  if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
                  {
                     // Replace this object with a JPG file
                     xObject.Stream.Value = File.ReadAllBytes($@"test\page {++imageCount}.jpg").ToArray();
                     xObject.Elements.SetValue("/Length", new PdfInteger(xObject.Stream.Value.Length));
                     xObject.Elements.SetValue("/ColorSpace", new PdfString("/DeviceRGB"));
                     xObject.Elements.SetValue("/Filter", new PdfString("/DCTDecode"));
                     xObject.Elements.SetValue("/Type", new PdfString("/XObject"));
                     xObject.Elements.Remove("/DecodeParams");
                  }
               }
            }
         }
      }
   }
   pdf.Save(@"test\out.pdf");
}


Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 1:53 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2720
Location: Cologne, Germany
Hi!

phirewind wrote:
"This is not possible with PDFSharp" is a valid answer, if that is the answer, and is much more helpful than silence.
PDFsharp is open source and "This is not possible with PDFsharp" is hardly ever a valid answer.
There could be a rather simple solution for files from scanner "Foo 9100" while a general solution will be much more complicated.

phirewind wrote:
However, when I open the resulting PDF, I get the message "An error exists on this page. Acrobat may not display the page correctly"
There is an error in a PDF file which we do not see. So you cannot expect more than speculation from us.
Maybe some other properties are incorrect ("/Width" or "/Height" or something else).

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 2:07 pm 
Offline

Joined: Wed May 10, 2017 2:35 pm
Posts: 8
I will send the file via pm/email, as I had to redact certain proprietary information from the scanned document.


Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 3:16 pm 
Offline

Joined: Wed May 10, 2017 2:35 pm
Posts: 8
I was able to sufficiently redact the samples and create a single-page test. The "in.pdf" in this case is actually significantly smaller than the "out.pdf", but the first relevant issue is the ability to replace one image with another and maintain PDF integrity.

and btw thanks in advance for any assistance. I know it helps to have a very specific question to answer.


Attachments:
File comment: Contains in.pdf, out.pdf, and page 1.jpg
sample.zip [175.89 KiB]
Downloaded 29 times
Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 3:21 pm 
Offline

Joined: Thu Sep 29, 2011 1:39 pm
Posts: 5
www.pdf-online.com says:

File out.pdf
Compliance pdf1.4
Result Document does not conform to PDF/A.
Details
Validating file "out.pdf" for conformance level pdf1.4
The value of the key Type must not be of type string.
The value of the key Type is (null) but must be XObject.
The value of the key ColorSpace must not be of type string.
The image's sample stream's computed length 1053150 is different to the actual length 118121.
The color space is invalid.
The document does not conform to the requested standard.
The document doesn't conform to the PDF reference (missing required entries, wrong value types, etc.).
Done.


Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 3:34 pm 
Offline

Joined: Thu Sep 29, 2011 1:39 pm
Posts: 5
I don't know if any of the above pdf/a check is actually relevant... another product, pdfharmony, said

page: 001 Could not find the XObject named 'Im1';

pdfHarmony reported no errors with your in.pdf.


Top
 Profile  
Reply with quote  
PostPosted: Thu May 11, 2017 3:44 pm 
Offline

Joined: Wed May 10, 2017 2:35 pm
Posts: 8
The first check may be relevant. I also forgot to correct the image size on the rebuilt sample and had misspelled "/DecodeParms", but I corrected those and had the same issue remain.

When I'm using xObject.Elements.SetValue, it is storing values as objects with a Value property, but not the same type of object (it adds other properties). I checked the values as they were applied in debug step-through to verify. But the PdfItem object doesn't appear to have an instantiator, so I can't use "new PdfItem('value')"; I may be chasing red herrings there, and it looks like other people use .SetValue in the same manner successfully, but that is a slightly curious thing. I can also see that it is storing the correct value in the Length key, but there must be some calculation Acrobat performs with the given info that comes up with the wrong estimate and chokes.


Top
 Profile  
Reply with quote  
PostPosted: Fri May 26, 2017 2:05 pm 
Offline

Joined: Wed May 10, 2017 2:35 pm
Posts: 8
So, no ideas how to replace an image in PDFSharp without corrupting the PDF?


Top
 Profile  
Reply with quote  
PostPosted: Fri May 26, 2017 6:32 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
phirewind wrote:
So, no ideas how to replace an image in PDFSharp without corrupting the PDF?
As I understand it your code corrupts the PDF file by inserting incorrect values and/or values in incorrect formats.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group