PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Fri May 03, 2024 3:59 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Mon Dec 07, 2009 8:44 am 
Offline

Joined: Mon Dec 07, 2009 8:33 am
Posts: 8
Hi, here is a small part of my batch pdf generator.
It is very specialized for my need, cloning directory structure and generating pdf's.
I have made it work with some Tiff files (multi-page ones too).
Code:
                             PdfDocument doc = new PdfDocument();  // initialize pdf document

                             Image img2 = Image.FromFile(source); // gather the src image

                             // As tiff files may contain several pages, we have to check how many pages there are.
                             Guid objGuid = img2.FrameDimensionsList[0]; 
                             FrameDimension objDimension = new FrameDimension(objGuid);
                             // subtract one from the amount of pages, due to index start at 0, not 1.
                             int pageCount = img2.GetFrameCount(objDimension) - 1;

                             // loop through the number of images the file contains
                             for (int i = 0; i <= pageCount; i++)
                             {
                                 PdfPage page = doc.AddPage(); // start by adding a page
                                 XGraphics xgr = XGraphics.FromPdfPage(page);

                                 img2.SelectActiveFrame(objDimension, i); // activate the n-th page in the file

                                 // if the user wishes a compressed image
                                 if (chkCompress.Checked)
                                 {
                                     XImage img = XImage.FromGdiPlusImage(img2);
                                     page.Width = img.PointWidth / 2; // make the canvas 50%
                                     page.Height = img.PointHeight / 2; // make the canvas 50%
                                     xgr.DrawImage(img, 0, 0, img.PointWidth / 2, img.PointHeight / 2); // make the image 50% of its original size
                                     // NOTE:
                                     // While this function does in deed shrink the image, the filesize will be identical.
                                     // Some claim that PDF in it self, can not compress images. Eg. I will have to maybe here have a JPEG conversion and make the
                                     // PDF based on that
                                     // -- Olav Alexander Mjelde
                                 } // end if
                                 else
                                 {
                                     XImage img = XImage.FromGdiPlusImage(img2); // fill the XImage with the GDI+ Image
                                     // Set sizes of canvas and image
                                     page.Width = img.PointWidth;
                                     page.Height = img.PointHeight;
                                     xgr.DrawImage(img, 0, 0);

                                 } // end else

                             } // end for loop

                             // dispose of the garbage ;)
                             img2.Dispose();
                             
                             // save the file and close it.
                             doc.Save(destinaton);
                             doc.Close();


However, I have 2 problems:
1. Very large Tiff files make the program crash (hogs too much memory)
2. Many files (compressed) will not work

I have version 1.30 from the GDI+ build.

Any tips on the compression-code would also be nice.. Eg. do I have to save it down as a temp Jpeg, then load it back in? Or can I scale it in the C# code, before adding it to the pdf-page?

The goal with this, is to make PDF-variants of scanned Tiff's.
For our usage, we can handle the fact that it's limited to non compressed tiff's, as I found a setting on the scanner that works like a charm.

However if I try Tiff's in the 300mb size, the program crashes (memory)..
Even if I take the files in Photoshop and save it down without compression.


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 07, 2009 9:01 am 
Offline

Joined: Mon Dec 07, 2009 8:33 am
Posts: 8
I will research a bit about the libtiff library, maybe I should implement it for a broader support of availible formats, as people claim the tiff handling by windows is less than ideal(?).


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 07, 2009 11:02 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3097
Location: Cologne, Germany
Hi!

Re 1:
This may help (especially if you are getting bitonal images from your scanner):
viewtopic.php?f=3&t=964
Correction is included in PDFsharp 1.31 (coming one of these days).
Switching to WPF build may help.

Re 2:
TIFF support is part of the operating system.
Different Windows versions implement different TIFF formats.

If you want to reduce the size of the PDF file, scaling must be done before you pass the image to PDFsharp (this kind of file size reduction is on our TODO list).
PDFsharp 1.31 (coming one of these days) also implements CCITT compression so bitonal images may be smaller in the final PDF file.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 17, 2009 9:55 am 
Offline

Joined: Mon Dec 07, 2009 8:33 am
Posts: 8
I am now testing the GDI+ for it's intended use:
6000++ files, in subdirectories.
Input files: Tiff, multi page - no compression.
Input file size: 8-10 Megabytes per file (300 DPI, gray scale)
Output files: PDF, LZW
Output file size: 1/10th of input file size (1:1 pixel)

Performance:
Peak performance: ~~50 PDF per 60 Seconds.
Avg. performance: 1892 files in 45 minutes = 42 files / 60 seconds
Avg. performance (after 1 hr): 2500 files / 60 minutes = 41 files / 60 seconds

An avg. of the avg gives about 0.7 files per second.
As the external drive is heavily I/O limited and it's reading 8-10 Mb files, writing 0.8-1Mb files, creating directories, etc. I believe that maybe it would (with larger files and an internal harddrive) manage an high I/O number. Now it's limited by the harddrive and the filesize (too small files to test properly), but I find it to be a good number.

I could have of course made it with a threadpool,as I actually did at first.
But I went back to the backgroundworker, mostly due to the hassle of handling all the threads, also I found the threadpool to slow down the processing, if I had very many small files.

On the larger 300+mb files, I guess a threadpool *might* be better, but then maybe a multi threading would hog too much memory too, lol..

This is not a final number and it's still processing.
Also I had a lot of applications running at first, Photoshop, Visual Studio, Outlook, Opera, etc. I have stolen some cpu with those, especially when I started closing them down.

However, the main "bad" thing about my test - is the fact that both my source and destination folders are on an old external hard drive (usb 2.0). eg. for most people, it would be much faster.

Also the PC I test on now, is only an AMD 5200+ with 4Gb ram.
My application peaks with ram usage: ~100 Megabytes
Avg. RAM usage is about 60 +- 10 Megabytes

The peak CPU usage is ~~44%, avg. 22% +- 10%.
The low numbers are of course affected by the I/O capabilities of the external hard drive, eg. my test can not be used as a "real" benchmark for the performance-abilities of pdfsharp.

My goal was not to do so either, but more to get a real life test for us.
The work flow is made in a way that over-complicates the process, if we where to move the files out of the hard drive again.

Anyhow. I plan on testing the 1.31 WPF build. Hoping I then can run this on 300-500 Megabyte Tiff files from A0 scanners and other equipment like book scanners.

The replication of source path is an important step, as now is covered with an ease.
As I said to my co-workers a couple of minutes ago: "I shouldnt have told you about the application, I should have just taken the 6000+ files to my office and then pretended to work manually for a year". Lol, it was of course said as a joke :-) But it makes a point though - this would have been so much manual labor, as we have to have the director structure mirrored - for the pdf's. Now it's an ease.. I already love my simple program batchPDF :-) But I could not have done it without pdfSharp!

So thanks a MILLION! (and sorry for my too long post, but I had a lot to say)


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 08, 2010 3:01 pm 
Offline

Joined: Mon Dec 07, 2009 8:33 am
Posts: 8
Again, not a support post.. I wanted to make a "my application" post, but I cant find the user-group you have to add your self to (trusted user group?).

Anyhow, now I have two PDF-applications and they work very well.

This is "Drag & Drop PDF":
Image

It is multithreaded with threadpool, it supports now Jpeg, Bmp, PNG, Tiff and PDF.
Eg. you just drop files into the list-box, it filters out the uncompatible files.

Then the user can sort the list both manually (many files at a time, if he/she wishes) and automatically.
Exclusion of files is also possible, then they are moved to a list in the tab "Ekskluderte" (excluded).
One can from the excluded list also move files back into the "Arbeidsbok" (work book).

Quality is actually just a dynamic re-sizer, eg. 60% quality = 60% of the pixels.

This application is made for making one large PDF of many files.
At first I did all the management in memory, no swapping on the hard drive. But on some really large collection of files, it would run out of memory and crash.

Then I changed the code so it now generates first temp files in a temp folder, on the drive.
It checks if the folder exists, if it does, it renames the tmp folder to the first tmp[x] which is free.
Then it creates a new - fresh tmp/ folder.

If the user has chosen the "Behold løse" (keep loose files), it afterwards looks for the subfolder pdf_1-1.
If it exists, it wants to rename the pdf_1-1 to the first free pdf_1-1[x] (of course in a loop).
When this is renamed, it moves the tmp to pdf_1-1.

The temp files are also named like so:
[x][y]oemfilename.ext.pdf

The x is the order it has in the combined PDF, the Y is the quality (% of OEM file).

I know this must seem like an odd application, but in the government where I work, we often have to make many versions of the same files. Also we later want to use the files.

Now it's very easy to do this.. It's just to drag&drop, then sort and hit generate pdf.
It archives the old versions automatically and it's quite fast (seems way faster than adobe acrobat).

I found it to work faster with temp files vs doing it all in cache.

ps. we large tiff files primarily.

My other application looks much nicer in the GUI, it is called batchPDF.
Image

batchPDF is much more limited in functionality..
batchPDF only makes 1:1 (input:output) pdf's, however in one way it is very special.

batchPDF replicates folder structure 1:1 too!
I made batchPDF to convert about 20.000 Tiff files I think it was.
We needed to have the pdf files in the same directory structure, with the same names (except extension) as the tiff files had.

I used about 5 hours to make batchPDF.
The GUI I made in photoshop, also as a progress indicator it used a jumping "dot" by the batchPDF text in the lower right corner.

It has a quality input as the other pdf-generator, but this also has a check box Jpeg Compression. (if unticked, it uses Tiff).

batchPDF does not support other files than images.
The large arrow button I also made in photoshop, it is used to copy the text from the txtSRC to the txtDST field.
One might think this is a silly button to have, but batchPDF is primarily used with the SRC and DST values identical.

batchPDF is also multi threaded, though it does not use a thread pool as the other converter does.
I have a live (when typing!) validation check on the txtSRC and txtDST, to see if it's a valid path.

So if you type in C, it is not valid yet (and red).
c: however, is a valid path (danger danger: it does subdir crawling, lol).
Image

Anyhow.. I just wanted to share my success and how much I enjoy working with this.
I have saved us for much manual labor, also I can give out these applications here without license issues.

Let the machines do your boring, repetitive work :-)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 78 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group