PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Reading a pdf document
https://forum.pdfsharp.net/viewtopic.php?f=2&t=3284
Page 1 of 1

Author:  ymak [ Wed Feb 03, 2016 10:23 am ]
Post subject:  Reading a pdf document

Hello,

I am looking for a library to open a pdf file and get a list of all the operators with their parameters.
I understand that PdfSharp is for creating Pdf documents but I was wondering if there is a way do that with PdfSharp?

Thank you
Yannis

Author:  Thomas Hoevel [ Wed Feb 03, 2016 10:41 am ]
Post subject:  Re: Reading a pdf document

Hi, Yannis,

You can get a string containing all instructions and all parameters, but have to do the parsing yourself.
There is a third-party library that helps with text extraction, maybe that is a good starting point.
https://www.nuget.org/packages/PdfTextract/
It is GPL and I don't know where you can find the source code.

Author:  ymak [ Wed Feb 03, 2016 11:25 am ]
Post subject:  Re: Reading a pdf document

Thank you Thomas!
Do I have to build the string or can I get from PdfSharp?
Is it a class property?

Author:  Thomas Hoevel [ Wed Feb 03, 2016 11:51 am ]
Post subject:  Re: Reading a pdf document

ymak wrote:
Do I have to build the string or can I get from PDFsharp?
Is it a class property?
I don't know. PdfTextract does it. Look how they do it.

Author:  ymak [ Wed Feb 03, 2016 12:54 pm ]
Post subject:  Re: Reading a pdf document

ok
Thank you

Author:  gnauck [ Thu Sep 07, 2017 9:08 am ]
Post subject:  Re: Reading a pdf document

I tried to use the code from the PdfExtract package to read text.
It works fine in many cases, but I have some PDFs where the output is weird.

The output from the attached PDF is the following:

Code:
\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+
\0\u0003\0\u001d\0W\0[\0H\07\0\u0003\0H\0O\0E\0D\0L\0U\0D\09 \0\u0003\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ \r\n


I assume those are encoding problems.
I am a dummy if it comes to PDFs, some guidance on how I could extract the text correctly with PdfSharp would be helpful.

Thanks,
Alex

Attachments:
hello-world.zip [28.05 KiB]
Downloaded 415 times

Author:  atamich [ Wed Mar 10, 2021 4:14 pm ]
Post subject:  Re: Reading a pdf document

Had the same problem, https://www.nuget.org/packages/PdfSharpTextExtractor/ - this saved my time, it is a lib to extract a text from any pdf. It solves \u0003\0\u0004 problem too.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/