PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Find out page number from bookmark entries and split
https://forum.pdfsharp.net/viewtopic.php?f=2&t=3663
Page 1 of 1

Author:  hamzas [ Thu Sep 14, 2017 12:43 pm ]
Post subject:  Find out page number from bookmark entries and split

Hi folks,

Trying to process a PDF file and split it using the bookmarks defined using PDFSharp and while I can get a list of bookmarks I can not figure out how to actually figure out what page number corresponds to the bookmark definition.

Back story: One of the engineering software we use generates a single PDF file that actually consists of three separate documents. In the infinite wisdom of this enterprise software company, they don't actually let you split these and save them as separate PDFs. There are also a couple of other quirks we post-process so I have a small utility that engineers run their output files through and I'd like to add the functionality to split that combined PDF into separate documents.

An example PDF file I am working with has three top level bookmarks defined, on pages 1, 5 and 6 and while I can see the bookmarks with the snippet below I couldn't figure out a way to map the bookmark to a page number.

Splitting the PDF seems to be fairly well documented, what I am stuck with is how I can map bookmarks to page numbers?

Test Code:

Code:
using (PdfDocument document = PdfReader.Open("test.pdf", PdfDocumentOpenMode.Import))
{
    PdfDictionary outline = document.Internals.Catalog.Elements.GetDictionary("/Outlines");

    Console.WriteLine("Page count: " + document.PageCount);

    foreach(var page in document.Pages)
    {
        // any hierarchy info on the page itself? doesn't seem to have any.
        Console.WriteLine(page.ToString());

    }

    for (PdfDictionary child = outline.Elements.GetDictionary("/First"); child != null; child = child.Elements.GetDictionary("/Next"))
    {
        Console.WriteLine(child.Elements.GetString("/Title"));

        // FIXME: get page numbers?

    }

}


Results in:

Code:
Page count: 9
<< /Contents [ 1019 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1018 0 R /Type /Page >>
<< /Contents [ 1022 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1021 0 R /Type /Page >>
<< /Contents [ 1025 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1024 0 R /Type /Page >>
<< /Contents [ 1028 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 3874 2667 ] /Parent 1 0 R /Resources 1027 0 R /Type /Page >>
<< /Contents [ 1032 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 842 595 ] /Parent 1 0 R /Resources 1031 0 R /Type /Page >>
<< /Annots [ 46 0 R 48 0 R 50 0 R 52 0 R 54 0 R 56 0 R 58 0 R 60 0 R 62 0 R 64 0 R 66 0 R 68 0 R 70 0 R 72 0 R 74 0 R ] /Contents [ 1043 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1042 0 R /Type /Page >>
<< /Annots [ 82 0 R 84 0 R 86 0 R 88 0 R 90 0 R 92 0 R 94 0 R 96 0 R 98 0 R 100 0 R 102 0 R 104 0 R 106 0 R 108 0 R 110 0 R 112 0 R 114 0 R 116 0 R 118 0 R 120 0 R 122 0 R 124 0 R 126 0 R 128 0 R 130 0 R 132 0 R 134 0 R 136 0 R 138 0 R 140 0 R 142 0 R 144 0 R 146 0 R 148 0 R 150 0 R 152 0 R 154 0 R 156 0 R 158 0 R ] /Contents [ 1048 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1047 0 R /Type /Page >>
<< /Annots [ 166 0 R 168 0 R 170 0 R 172 0 R 174 0 R 176 0 R 178 0 R 180 0 R 182 0 R ] /Contents [ 1053 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1052 0 R /Type /Page >>
<< /Annots [ 190 0 R 192 0 R 194 0 R 196 0 R ] /Contents [ 1058 0 R ] /Group << /CS /DeviceRGB /S /Transparency >> /MediaBox [ 0 0 1130 799 ] /Parent 1 0 R /Resources 1057 0 R /Type /Page >>
Bookmark 1
Bookmark 2
Bookmark 3


Manually looking at the file I know the three top level bookmarks defined are on pages 1 (Bookmark 1), 5 (Bookmark 2) and 6 (Bookmark 3). How can I go about extracting this information using PDFSharp?

Thanks for any pointers.

Author:  Thomas Hoevel [ Thu Sep 14, 2017 1:13 pm ]
Post subject:  Re: Find out page number from bookmark entries and split

Hi!
hamzas wrote:
Thanks for any pointers.
Those outlines (bookmark entries) may have an Action entry "/A" or a Destination entry "/Dest". The latter contains the page reference directly, the former should be a GoTo action with a page reference.

Analyse the outline elements in the debugger and see which properties allow you get "/Dest" or "/A".

Author:  hamzas [ Thu Sep 14, 2017 2:20 pm ]
Post subject:  Re: Find out page number from bookmark entries and split

Thanks for the help, Thomas.

I am not very familiar with the PDF file format but according to this doc (http://www.pdfsharp.net/wiki/WorkOnPdfO ... ample.ashx) I should be looking for "/S" and "/D", perhaps?

Traversing through one of the outline objects and looking for "/A" as you've suggested, I did get the following:

Image

Still don't seem to get page numbers, per se. Is
Code:
iref(39, 0)
the secret? How can I map this to an actual page number?

Cheers.

Author:  Thomas Hoevel [ Thu Sep 14, 2017 3:33 pm ]
Post subject:  Re: Find out page number from bookmark entries and split

Your page dump does not include the ID of the page objects. One of the pages will have the ID "39 0" and its position in the dictionary tells you the page number in the PDF.

Bookmarks with "/A" are rare, deprecated by Adobe, less compatible, and require more bytes. With most other PDF files you will find the "/Dest" element at the outline.

With outlines, the destination is "/Dest". With actions, the destination is "/D".

Author:  hamzas [ Thu Sep 14, 2017 10:40 pm ]
Post subject:  Re: Find out page number from bookmark entries and split

Thanks for the pointers Thomas, looks like I have some more digging to do!

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/