PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Failure to retrieve text of PDF documents
https://forum.pdfsharp.net/viewtopic.php?f=2&t=3961
Page 1 of 1

Author:  rdunnill [ Fri May 17, 2019 6:17 am ]
Post subject:  Failure to retrieve text of PDF documents

We use PDFSharp to process PDFs created with a Crystal Reports report. Our processing involves opening a PDF document with PDFSharp, and using PDFSharp to extract the text from each page of the document, searching it for specific text inserted during creation as delineation markers. This process works fine with PDFs created with Crystal Reports versions prior to SP23; however, with documents created with SP23, the reads returns garbled text and hence the delineators cannot be found.

Chrome, Firefox and Beyond Compare can read these new documents without issue. What can be done to fix this problem so that we can continue to use PDFSharp for our processing?

Author:  rdunnill [ Fri May 17, 2019 3:58 pm ]
Post subject:  ContentReader.ReadContent() returns garbled text

We use PDFSharp to process PDFs created with a Crystal Reports report. Our processing involves opening a PDF document with PDFSharp, and using PDFSharp (ContentReader.ReadContent()) to extract the text from each page of the document, searching it for specific text inserted during creation as delineation markers. This process works fine with PDFs created with Crystal Reports versions prior to SP23; however, with documents created with SP23, the reads returns garbled text and hence the delineators cannot be found.

Chrome, Firefox and Beyond Compare can read these new documents without issue. What can be done to fix this problem so that we can continue to use PDFSharp for our processing?

Author:  rjdunnill [ Tue May 21, 2019 11:18 pm ]
Post subject:  Re: Failure to retrieve text of PDF documents

On further analysis, this seems to be happening because the document's Tj operator calls use text consisting of indexes instead of ASCII characters.

Is there a setting or parameter that tells PdfSharp to interpret the document text as such?

Author:  Thomas Hoevel [ Wed May 22, 2019 8:59 am ]
Post subject:  Re: Failure to retrieve text of PDF documents

rjdunnill wrote:
Is there a setting or parameter that tells PdfSharp to interpret the document text as such?
Since PDFsharp does not render PDF it has only limited support for analyzing the instructions that draw the page.
Not my area of expertise, but I'm afraid you'll have to write code to decode the Tj parameters.

Author:  rjdunnill [ Wed May 22, 2019 5:15 pm ]
Post subject:  Re: Failure to retrieve text of PDF documents

Our algorithm opens the PDF document, and reads the content of each page, searching said content for a particular tag. Our problem is that the new-format documents are Unicode, and hence the read content consists of indexes and not text. Shouldn't PDFSharp be internally converting the indexes to their respective characters?

Author:  Thomas Hoevel [ Thu May 23, 2019 8:13 am ]
Post subject:  Re: Failure to retrieve text of PDF documents

rjdunnill wrote:
Shouldn't PDFsharp be internally converting the indexes to their respective characters?
I fully understand that this would be convenient for you, but since PDFsharp does not do anything with the strings (yet), this functionality is not yet included in PDFsharp.
Feel free to share your code if you implement this conversion.

Author:  rjdunnill [ Fri May 31, 2019 1:48 am ]
Post subject:  Re: Failure to retrieve text of PDF documents

Management approval would be required to share the code; I'll ask. Meanwhile, our addition will consist of adding a method to extract condensed text from a page (without spaces), similar to IronPDF's ExtractTextFromPage() method, to PdfPage. (ExtractTextFromPage() extracts the text, sans spaces, but doesn't work properly with Unicode-encoded documents.)

With regards to adding this functionality to PdfSharp, is there currently any support inside PdfSharp for ToUnicode CMaps? Do I have to create my own class, or can I use an existing one within PdfSharp? And is there any support internally parsing the ToUnicode maps?

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/