PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Merging multiple PDFs while minimizing memory use
https://forum.pdfsharp.net/viewtopic.php?f=2&t=4259
Page 1 of 1

Author:  tm8747a [ Tue Jun 08, 2021 7:27 pm ]
Post subject:  Merging multiple PDFs while minimizing memory use

I'm using PDFsharp to merge many PDFs (stored on disk) into one PDF. Sometimes the end product PDF can be as large as 700MB. I'm using the sample code provided that basically creates an output PdfDocument, adds pages to it, and then calls outputDocument.Save(destinationPath), so the amount of memory used is about the same as the size of documents produced. Here's a link to the sample:

http://www.pdfsharp.net/wiki/concatenat ... ample.ashx

I tried to use a FileStream in the constructor of PdfDocument when creating the output, that did not seem to work. Somebody suggested that I write a certain number of files, close the PDF, re-open using PdfReader.Open() and continue. Not sure how that would work seeing as I think PdfReader.Open() will load the whole document in memory as far as I know, but I tried it and sure enough it did not look like memory consumption decreased.

Below is the code for a simple console app that tries to merge 2000 files, it closes the doc every 500 pages and re-opens. I'm using PDFsharp-MigraDoc-gdi 1.50.5147, targeting .NET Framework 4.5.

If this cannot be done with PdfSharp, would MigraDocs be any help?

Code:
using System;
using System.Collections.Generic;
using System.IO;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;

namespace PdfSharpMergeTest
{
    class Program
    {
        public static void Main(string[] args)
        {
            var files = new List<string>();
            var basePath = AppDomain.CurrentDomain.BaseDirectory;

            for (var i = 0; i < 2000; i++)
            {
                files.Add($"{basePath}\\sample.pdf");
            }
            DoMerge(files, $"{basePath}\\output.pdf");
        }

        private static void DoMerge(List<string> paths, string destinationFile)
        {

            var directory = Path.GetDirectoryName(destinationFile);

            if (!Directory.Exists(directory))
            {
                Directory.CreateDirectory(directory);
            }

            var outputDocument = new PdfDocument();
            var count = 0;

            // Iterate files
            foreach (string path in paths)
            {
                // Open the document to import pages from it.
                try
                {
                    var inputDocument = PdfReader.Open(path, PdfDocumentOpenMode.Import);

                    // Iterate pages
                    for (int idx = 0; idx < inputDocument.PageCount; idx++)
                    {
                        // Get the page from the external document...
                        PdfPage page = inputDocument.Pages[idx];
                        // ...and add it to the output document.
                        outputDocument.AddPage(page);
                    }

                    inputDocument.Dispose();
                   
                    count++;
                    if (count % 500 == 0 || count == paths.Count)
                    {
                        outputDocument.Save(destinationFile);
                        outputDocument.Dispose();

                        if (count < paths.Count)
                        {
                            outputDocument = PdfReader.Open(destinationFile, PdfDocumentOpenMode.Import);
                        }
                    }
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.Message);
                    Console.WriteLine(ex.StackTrace);
                }
            }
        }
    }
}

Author:  TH-Soft [ Wed Jun 09, 2021 7:55 am ]
Post subject:  Re: Merging multiple PDFs while minimizing memory use

Discussed and answered on SO:
https://stackoverflow.com/a/67885787/162529

Author:  tm8747a [ Wed Jun 09, 2021 1:59 pm ]
Post subject:  Re: Merging multiple PDFs while minimizing memory use

TH-Soft wrote:
Discussed and answered on SO:
https://stackoverflow.com/a/67885787/162529


Perhaps I'm doing something wrong? Am I closing and re-opening the right way? I tried with a variety of intervals, here are my results.

4 page 140KB sample file, produces 273MB output file

no interval, 21 seconds, max memory 330MB
1000 interval, 30 seconds, max memory 490MB
500 interval, 55secs, max memory 710MB
250 interval, 1min 35sec, max memory 780MB
100 interval, 2min 55secs, max memory 850mb

So not only is reducing the interval making the memory use worse, it's also significantly slowing down the application, which I expected since I assume it's a fairly expensive operation. But it's buying me nothing on the memory front, it's actually making it worse. I do see the memory drop as things run, but invariably it climbs back higher and higher.

Author:  tm8747a [ Thu Jun 10, 2021 1:15 pm ]
Post subject:  Re: Merging multiple PDFs while minimizing memory use

Is anybody able to confirm that the code above would be the correct way to close and re-open the file? I'm perfectly fine with the result being "what you are trying to accomplish is not possible", I just don't want to give up if there's something else I could be doing. I think I'll be able to live with the memory consumption, but if I can avoid any risk I will.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/