PDFsharp & MigraDoc Foundation :: View topic - Problem for getting text from PDF file

PdfDocument document = PdfReader.Open(sTempFile, PdfDocumentOpenMode.ReadOnly);

int pageNo = -1;
string strStreamValue;
byte[] streamValue;
foreach (PdfPage page in document.Pages)
{
pageNo++;
strStreamValue = "";

// put the stream value for every element on the page in a string variable.
for (int i = 0; i < page.Contents.Elements.Count; i++)
{
PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;
streamValue = stream.Value;
foreach (byte b in streamValue)
{
strStreamValue += (char)b;
}
}

}

Author:	robatjazi [ Wed Sep 12, 2012 10:26 pm ]
Post subject:	Problem for getting text from PDF file
I want to extract data from PDF file. following is my code: Code: PdfDocument document = PdfReader.Open(sTempFile, PdfDocumentOpenMode.ReadOnly); int pageNo = -1; string strStreamValue; byte[] streamValue; foreach (PdfPage page in document.Pages) { pageNo++; strStreamValue = ""; // put the stream value for every element on the page in a string variable. for (int i = 0; i < page.Contents.Elements.Count; i++) { PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream; streamValue = stream.Value; foreach (byte b in streamValue) { strStreamValue += (char)b; } } } But what I am getting is some unreadable text like: 0.05 0 0 -0.05 0 792 cm 0 0 0 rg 0 0 0 RG 2 J 30 w 357 662 m . . . (!"#$%&'\('\)'%+,\)-.) Tj 0 -240 Td (/012) Tj 0 -240 Td (,\)-%345%$6%748'549') Tj 3840 480 Td (&'\('\)') Tj 0 -1020 Td (:;'5) Tj 0 -270 Td (:5<\)') Tj Is there any encoding for PDF file? Any one knows how I can fix this problem? Thank you in advanc

Author:	Thomas Hoevel [ Thu Sep 13, 2012 8:19 am ]
Post subject:	Re: Problem for getting text from PDF file
I clicked on "Search" and found these topics: viewtopic.php?p=1603#p1603 viewtopic.php?p=5556#p5556

Author:	robatjazi [ Thu Sep 13, 2012 1:56 pm ]
Post subject:	Re: Problem for getting text from PDF file
Hi Thomas, Thank you for reply. I looked those links and used one of them. Result was the same. in this line of my code : Code: PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream; I get unreadable text like **(!"#$%&'\('\)'%+,\)-.) Tj*** As I know by command Tj we have text data. But those text data are not readable. My question is Does PdfSharp handle encoding of pdf file? and How can I get the real text from pdf file by using PdfSharp Regards, Majid

Author:	Thomas Hoevel [ Thu Sep 13, 2012 2:12 pm ]
Post subject:	Re: Problem for getting text from PDF file
Hi, Majid, PDFsharp does not decode the text for you. BTW: some PDF files are not properly encoded. Try to select text in your file using Adobe Reader, copy it to the clipboard and paste it to e.g. Notepad. Do you get the correct text here? If not: malformed PDF file.

Author:	robatjazi [ Thu Sep 13, 2012 2:33 pm ]
Post subject:	Re: Problem for getting text from PDF file
Hi Thomas , I can select text from Adobe Reader and copy it into Notepad. I see correct text. Also I used iTExtsharp and this tool return correct text data from pdf file. I do not know what is the problem Thanks, majid

PDFsharp & MigraDoc Foundation https://forum.pdfsharp.net/

Problem for getting text from PDF file https://forum.pdfsharp.net/viewtopic.php?f=2&t=2138	Page 1 of 1

Author:	robatjazi [ Fri Sep 14, 2012 1:53 pm ]
Post subject:	Re: Problem for getting text from PDF file
Hi Thomas , I found that there are different PDF specification versions like PDF-1.2, PDF-1.3, ... PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data. Do you have any comment or suggestion to handle this situation? It would be great if you give me a clue that what should I do. Thanks, majid

Author:	Thomas Hoevel [ Mon Sep 17, 2012 8:26 am ]
Post subject:	Re: Problem for getting text from PDF file
robatjazi wrote: PDFsharp works fine with PDF-1.3, but when I want get text data from a PDF file with PDF-1.2 specification it return some bad text data. Do you have any comment or suggestion to handle this situation? I presume the 1.2 files come from a different application than the 1.3 files. The files that work will still work when you change the header from 1.3 to 1.2. The files that do not work will still fail with PDFsharp if you change the header from 1.2 to 1.3. For Adobe Reader it should make no difference either.

Page 1 of 1	All times are UTC
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/