PDFsharp & MigraDoc Foundation :: View topic

PDFsharp & MigraDoc Foundation https://forum.pdfsharp.net/

Reading a pdf document https://forum.pdfsharp.net/viewtopic.php?f=2&t=3284	Page 1 of 1

Author:	ymak [ Wed Feb 03, 2016 10:23 am ]
Post subject:	Reading a pdf document
Hello, I am looking for a library to open a pdf file and get a list of all the operators with their parameters. I understand that PdfSharp is for creating Pdf documents but I was wondering if there is a way do that with PdfSharp? Thank you Yannis

Author:	Thomas Hoevel [ Wed Feb 03, 2016 10:41 am ]
Post subject:	Re: Reading a pdf document
Hi, Yannis, You can get a string containing all instructions and all parameters, but have to do the parsing yourself. There is a third-party library that helps with text extraction, maybe that is a good starting point. https://www.nuget.org/packages/PdfTextract/ It is GPL and I don't know where you can find the source code.

Author:	ymak [ Wed Feb 03, 2016 11:25 am ]
Post subject:	Re: Reading a pdf document
Thank you Thomas! Do I have to build the string or can I get from PdfSharp? Is it a class property?

Author:	Thomas Hoevel [ Wed Feb 03, 2016 11:51 am ]
Post subject:	Re: Reading a pdf document
ymak wrote: Do I have to build the string or can I get from PDFsharp? Is it a class property? I don't know. PdfTextract does it. Look how they do it.

Author:	ymak [ Wed Feb 03, 2016 12:54 pm ]
Post subject:	Re: Reading a pdf document
ok Thank you

Author:

gnauck [ Thu Sep 07, 2017 9:08 am ]

Post subject:

Re: Reading a pdf document

I tried to use the code from the PdfExtract package to read text.
It works fine in many cases, but I have some PDFs where the output is weird.

The output from the attached PDF is the following:

Code:

\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ 
\0\u0003\0\u001d\0W\0[\0H\07\0\u0003\0H\0O\0E\0D\0L\0U\0D\09 \0\u0003\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ \r\n

I assume those are encoding problems.
I am a dummy if it comes to PDFs, some guidance on how I could extract the text correctly with PdfSharp would be helpful.

Thanks,
Alex

Attachments:

hello-world.zip [28.05 KiB]
Downloaded 677 times

Author:	atamich [ Wed Mar 10, 2021 4:14 pm ]
Post subject:	Re: Reading a pdf document
Had the same problem, https://www.nuget.org/packages/PdfSharpTextExtractor/ - this saved my time, it is a lib to extract a text from any pdf. It solves \u0003\0\u0004 problem too.

Page 1 of 1	All times are UTC
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/