PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Tue Mar 19, 2024 4:47 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 7 posts ] 
Author Message
PostPosted: Fri May 17, 2019 6:17 am 
Offline

Joined: Fri May 17, 2019 6:00 am
Posts: 2
We use PDFSharp to process PDFs created with a Crystal Reports report. Our processing involves opening a PDF document with PDFSharp, and using PDFSharp to extract the text from each page of the document, searching it for specific text inserted during creation as delineation markers. This process works fine with PDFs created with Crystal Reports versions prior to SP23; however, with documents created with SP23, the reads returns garbled text and hence the delineators cannot be found.

Chrome, Firefox and Beyond Compare can read these new documents without issue. What can be done to fix this problem so that we can continue to use PDFSharp for our processing?


Top
 Profile  
Reply with quote  
PostPosted: Fri May 17, 2019 3:58 pm 
Offline

Joined: Fri May 17, 2019 6:00 am
Posts: 2
We use PDFSharp to process PDFs created with a Crystal Reports report. Our processing involves opening a PDF document with PDFSharp, and using PDFSharp (ContentReader.ReadContent()) to extract the text from each page of the document, searching it for specific text inserted during creation as delineation markers. This process works fine with PDFs created with Crystal Reports versions prior to SP23; however, with documents created with SP23, the reads returns garbled text and hence the delineators cannot be found.

Chrome, Firefox and Beyond Compare can read these new documents without issue. What can be done to fix this problem so that we can continue to use PDFSharp for our processing?


Top
 Profile  
Reply with quote  
PostPosted: Tue May 21, 2019 11:18 pm 
Offline

Joined: Wed May 15, 2019 8:30 pm
Posts: 3
On further analysis, this seems to be happening because the document's Tj operator calls use text consisting of indexes instead of ASCII characters.

Is there a setting or parameter that tells PdfSharp to interpret the document text as such?


Top
 Profile  
Reply with quote  
PostPosted: Wed May 22, 2019 8:59 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3092
Location: Cologne, Germany
rjdunnill wrote:
Is there a setting or parameter that tells PdfSharp to interpret the document text as such?
Since PDFsharp does not render PDF it has only limited support for analyzing the instructions that draw the page.
Not my area of expertise, but I'm afraid you'll have to write code to decode the Tj parameters.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Wed May 22, 2019 5:15 pm 
Offline

Joined: Wed May 15, 2019 8:30 pm
Posts: 3
Our algorithm opens the PDF document, and reads the content of each page, searching said content for a particular tag. Our problem is that the new-format documents are Unicode, and hence the read content consists of indexes and not text. Shouldn't PDFSharp be internally converting the indexes to their respective characters?


Top
 Profile  
Reply with quote  
PostPosted: Thu May 23, 2019 8:13 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3092
Location: Cologne, Germany
rjdunnill wrote:
Shouldn't PDFsharp be internally converting the indexes to their respective characters?
I fully understand that this would be convenient for you, but since PDFsharp does not do anything with the strings (yet), this functionality is not yet included in PDFsharp.
Feel free to share your code if you implement this conversion.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Fri May 31, 2019 1:48 am 
Offline

Joined: Wed May 15, 2019 8:30 pm
Posts: 3
Management approval would be required to share the code; I'll ask. Meanwhile, our addition will consist of adding a method to extract condensed text from a page (without spaces), similar to IronPDF's ExtractTextFromPage() method, to PdfPage. (ExtractTextFromPage() extracts the text, sans spaces, but doesn't work properly with Unicode-encoded documents.)

With regards to adding this functionality to PdfSharp, is there currently any support inside PdfSharp for ToUnicode CMaps? Do I have to create my own class, or can I use an existing one within PdfSharp? And is there any support internally parsing the ToUnicode maps?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 27 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group