PDFsharp & MigraDoc Foundation https://forum.pdfsharp.net/ |
|
Reading a pdf document https://forum.pdfsharp.net/viewtopic.php?f=2&t=3284 |
Page 1 of 1 |
Author: | ymak [ Wed Feb 03, 2016 10:23 am ] |
Post subject: | Reading a pdf document |
Hello, I am looking for a library to open a pdf file and get a list of all the operators with their parameters. I understand that PdfSharp is for creating Pdf documents but I was wondering if there is a way do that with PdfSharp? Thank you Yannis |
Author: | Thomas Hoevel [ Wed Feb 03, 2016 10:41 am ] |
Post subject: | Re: Reading a pdf document |
Hi, Yannis, You can get a string containing all instructions and all parameters, but have to do the parsing yourself. There is a third-party library that helps with text extraction, maybe that is a good starting point. https://www.nuget.org/packages/PdfTextract/ It is GPL and I don't know where you can find the source code. |
Author: | ymak [ Wed Feb 03, 2016 11:25 am ] |
Post subject: | Re: Reading a pdf document |
Thank you Thomas! Do I have to build the string or can I get from PdfSharp? Is it a class property? |
Author: | Thomas Hoevel [ Wed Feb 03, 2016 11:51 am ] |
Post subject: | Re: Reading a pdf document |
ymak wrote: Do I have to build the string or can I get from PDFsharp? I don't know. PdfTextract does it. Look how they do it.
Is it a class property? |
Author: | ymak [ Wed Feb 03, 2016 12:54 pm ] |
Post subject: | Re: Reading a pdf document |
ok Thank you |
Author: | gnauck [ Thu Sep 07, 2017 9:08 am ] | ||
Post subject: | Re: Reading a pdf document | ||
I tried to use the code from the PdfExtract package to read text. It works fine in many cases, but I have some PDFs where the output is weird. The output from the attached PDF is the following: Code: \0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ \0\u0003\0\u001d\0W\0[\0H\07\0\u0003\0H\0O\0E\0D\0L\0U\0D\09 \0\u0003\0\u0003\0\u0004\0G\0O\0U\0R\0:\0\u0003\0\u000f\0R\0O\0O\0H\0+ \r\n I assume those are encoding problems. I am a dummy if it comes to PDFs, some guidance on how I could extract the text correctly with PdfSharp would be helpful. Thanks, Alex
|
Author: | atamich [ Wed Mar 10, 2021 4:14 pm ] |
Post subject: | Re: Reading a pdf document |
Had the same problem, https://www.nuget.org/packages/PdfSharpTextExtractor/ - this saved my time, it is a lib to extract a text from any pdf. It solves \u0003\0\u0004 problem too. |
Page 1 of 1 | All times are UTC |
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/ |