PDFsharp & MigraDoc Foundation • View topic

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Reading a pdf document

Moderator: Stefan Lange

Page 1 of 1

[ 7 posts ]

Print view

Previous topic | Next topic

Author

Message

ymak

Post subject: Reading a pdf document

Posted: Wed Feb 03, 2016 10:23 am

Joined: Wed Feb 03, 2016 10:12 am
Posts: 3

Hello,

I am looking for a library to open a pdf file and get a list of all the operators with their parameters.
I understand that PdfSharp is for creating Pdf documents but I was wondering if there is a way do that with PdfSharp?

Thank you
Yannis

Top

Thomas Hoevel

Post subject: Re: Reading a pdf document

Posted: Wed Feb 03, 2016 10:41 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

Hi, Yannis,

You can get a string containing all instructions and all parameters, but have to do the parsing yourself.
There is a third-party library that helps with text extraction, maybe that is a good starting point.
https://www.nuget.org/packages/PdfTextract/
It is GPL and I don't know where you can find the source code.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

ymak

Post subject: Re: Reading a pdf document

Posted: Wed Feb 03, 2016 11:25 am

Joined: Wed Feb 03, 2016 10:12 am
Posts: 3

Thank you Thomas!
Do I have to build the string or can I get from PdfSharp?
Is it a class property?

Top

Thomas Hoevel

Post subject: Re: Reading a pdf document

Posted: Wed Feb 03, 2016 11:51 am

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

ymak wrote:

Do I have to build the string or can I get from PDFsharp?
Is it a class property?

I don't know. PdfTextract does it. Look how they do it.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

ymak

Post subject: Re: Reading a pdf document

Posted: Wed Feb 03, 2016 12:54 pm

Joined: Wed Feb 03, 2016 10:12 am
Posts: 3

ok
Thank you

Top

gnauck

Post subject: Re: Reading a pdf document

Posted: Thu Sep 07, 2017 9:08 am

Joined: Tue Aug 08, 2017 2:57 pm
Posts: 2

I tried to use the code from the PdfExtract package to read text.
It works fine in many cases, but I have some PDFs where the output is weird.

The output from the attached PDF is the following:

Code:

\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ 
\0\u0003\0\u001d\0W\0[\0H\07\0\u0003\0H\0O\0E\0D\0L\0U\0D\09 \0\u0003\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ \r\n

I assume those are encoding problems.
I am a dummy if it comes to PDFs, some guidance on how I could extract the text correctly with PdfSharp would be helpful.

Thanks,
Alex

Attachments:

hello-world.zip [28.05 KiB]
Downloaded 437 times

Top

atamich

Post subject: Re: Reading a pdf document

Posted: Wed Mar 10, 2021 4:14 pm

Joined: Wed Mar 10, 2021 4:10 pm
Posts: 1

Had the same problem, https://www.nuget.org/packages/PdfSharpTextExtractor/ - this saved my time, it is a lib to extract a text from any pdf. It solves \u0003\0\u0004 problem too.

Top

Page 1 of 1

[ 7 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 172 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum