PDFsharp & MigraDoc Foundation • View topic - Problem retrieving raw text content

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

Problem retrieving raw text content

Moderator: Stefan Lange

Page 1 of 1

[ 7 posts ]

Print view

Previous topic | Next topic

Author

Message

antesima

Post subject: Problem retrieving raw text content

Posted: Wed Jul 02, 2008 9:46 am

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5

Hello,

thanks for this excellent library, I use it almost every day.

Now, I'm having problems retrieving raw text from an existing
PDF File.

I use the following code, and the string are then parsed to find some info.

Code:

PdfDocument pddDoc = PdfReader.Open(strPath_, PdfDocumentOpenMode.ReadOnly);

foreach (PdfPage ppgPage in pddDoc.Pages)
{
    strReturn += Page.Contents.CreateSingleContent().Stream.ToString();
}

All is ok with PDF files generated by MS Reporting Services,
but with some PDF files generated with MigraDoc, I retrieve no text,
just codes as these ones :

Code:

0 Td <005000440076005700550048> Tj
979 0 Td <003C005900480056> Tj
662 0 Td <002700480045004800570048005100460052005800550057> Tj

It looks like dictionnary keys, but how can I extract the text content from it ?

Regards,
Antesima

Top

antesima

Post subject:

Posted: Wed Jul 16, 2008 2:48 pm

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5

Does somebody have a clue ?

Do you need a sample code that generates the PDF ?
(as it is generated with PDFSharp with Times New Roman font).

Top

Thomas Hoevel

Post subject:

Posted: Wed Jul 16, 2008 3:12 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

Between the brackets you see Unicode characters in hex format.
You can convert them to Unicode strings using .NET (take 4 chars, convert to int, convert to char, add to string).

Since the high byte is always 00 (in the samples shown) these are odinary ANSI chars.

I may be wrong: maybe these are not Unicode chars, but indices into the font subset.

It should also be possible to create ANSI PDF files with MigraDoc (it's a parameter of PdfDocumentRenderer).
OTOH for compatibility of your application with unknown PDF files you should support Unicode, too.

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

antesima

Post subject:

Posted: Thu Jul 17, 2008 6:27 am

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5

Ok thank you I will give it a try and give you the feedback.

Regards,
Antesima

Top

antesima

Post subject:

Posted: Thu Jul 24, 2008 8:56 am

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5

It doesn't seem to fit...

Here is the string I try to convert :

"00280057005800470048"

and the code I use :

Code:

private static string ConvertNumericUnicode(string strArgument_)
        {
            string strResult = null;
            string strCurrent = strArgument_;
            while (strCurrent.Length >= 4)
            {
                int iChar = Int32.Parse(strCurrent.Substring(0, 4), System.Globalization.NumberStyles.AllowHexSpecifier);
                
                char cTemp = (char)iChar;
                strResult += cTemp;
                strCurrent = strCurrent.Substring(4);
            }
            return strResult;
        }

A I missing something ?

Regards,
Antesima

Top

Thomas Hoevel

Post subject:

Posted: Thu Jul 24, 2008 3:12 pm

PDFsharp Guru

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany

So it seems these are indices into the font subsets, not unicode character codes (would be too simple :cry:

); don't blame me, I warned you about it.

So you have to add another level of indirection by looking into the font table. That's not my area of expertise so I can't give you any clue.

The other solution: create ANSI PDF files ...

_________________
Regards
Thomas Hoevel
PDFsharp Team

Top

antesima

Post subject:

Posted: Tue Jul 29, 2008 12:38 pm

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5

Ok thank you, I will try to get the fonts and extract the text.

If I manage to have some code that work, I will publish it here.

Top

Page 1 of 1

[ 7 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: No registered users and 314 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum