PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Thu Mar 28, 2024 8:29 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 7 posts ] 
Author Message
 Post subject: Reading a pdf document
PostPosted: Wed Feb 03, 2016 10:23 am 
Offline

Joined: Wed Feb 03, 2016 10:12 am
Posts: 3
Hello,

I am looking for a library to open a pdf file and get a list of all the operators with their parameters.
I understand that PdfSharp is for creating Pdf documents but I was wondering if there is a way do that with PdfSharp?

Thank you
Yannis


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 03, 2016 10:41 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3095
Location: Cologne, Germany
Hi, Yannis,

You can get a string containing all instructions and all parameters, but have to do the parsing yourself.
There is a third-party library that helps with text extraction, maybe that is a good starting point.
https://www.nuget.org/packages/PdfTextract/
It is GPL and I don't know where you can find the source code.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 03, 2016 11:25 am 
Offline

Joined: Wed Feb 03, 2016 10:12 am
Posts: 3
Thank you Thomas!
Do I have to build the string or can I get from PdfSharp?
Is it a class property?


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 03, 2016 11:51 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3095
Location: Cologne, Germany
ymak wrote:
Do I have to build the string or can I get from PDFsharp?
Is it a class property?
I don't know. PdfTextract does it. Look how they do it.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 03, 2016 12:54 pm 
Offline

Joined: Wed Feb 03, 2016 10:12 am
Posts: 3
ok
Thank you


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 07, 2017 9:08 am 
Offline

Joined: Tue Aug 08, 2017 2:57 pm
Posts: 2
I tried to use the code from the PdfExtract package to read text.
It works fine in many cases, but I have some PDFs where the output is weird.

The output from the attached PDF is the following:

Code:
\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+
\0\u0003\0\u001d\0W\0[\0H\07\0\u0003\0H\0O\0E\0D\0L\0U\0D\09 \0\u0003\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ \r\n


I assume those are encoding problems.
I am a dummy if it comes to PDFs, some guidance on how I could extract the text correctly with PdfSharp would be helpful.

Thanks,
Alex


Attachments:
hello-world.zip [28.05 KiB]
Downloaded 415 times
Top
 Profile  
Reply with quote  
PostPosted: Wed Mar 10, 2021 4:14 pm 
Offline

Joined: Wed Mar 10, 2021 4:10 pm
Posts: 1
Had the same problem, https://www.nuget.org/packages/PdfSharpTextExtractor/ - this saved my time, it is a lib to extract a text from any pdf. It solves \u0003\0\u0004 problem too.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 122 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group