PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Fri Apr 19, 2024 2:44 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 7 posts ] 
Author Message
PostPosted: Wed Jul 02, 2008 9:46 am 
Offline

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5
Hello,

thanks for this excellent library, I use it almost every day.

Now, I'm having problems retrieving raw text from an existing
PDF File.

I use the following code, and the string are then parsed to find some info.

Code:
PdfDocument pddDoc = PdfReader.Open(strPath_, PdfDocumentOpenMode.ReadOnly);

foreach (PdfPage ppgPage in pddDoc.Pages)
{
    strReturn += Page.Contents.CreateSingleContent().Stream.ToString();
}


All is ok with PDF files generated by MS Reporting Services,
but with some PDF files generated with MigraDoc, I retrieve no text,
just codes as these ones :

Code:
15 0 Td <005000440076005700550048> Tj
36.979 0 Td <003C005900480056> Tj
27.662 0 Td <002700480045004800570048005100460052005800550057> Tj


It looks like dictionnary keys, but how can I extract the text content from it ?

Regards,
Antesima


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Jul 16, 2008 2:48 pm 
Offline

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5
Does somebody have a clue ?

Do you need a sample code that generates the PDF ?
(as it is generated with PDFSharp with Times New Roman font).


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Jul 16, 2008 3:12 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany
Between the brackets you see Unicode characters in hex format.
You can convert them to Unicode strings using .NET (take 4 chars, convert to int, convert to char, add to string).

Since the high byte is always 00 (in the samples shown) these are odinary ANSI chars.

I may be wrong: maybe these are not Unicode chars, but indices into the font subset.

It should also be possible to create ANSI PDF files with MigraDoc (it's a parameter of PdfDocumentRenderer).
OTOH for compatibility of your application with unknown PDF files you should support Unicode, too.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jul 17, 2008 6:27 am 
Offline

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5
Ok thank you I will give it a try and give you the feedback.

Regards,
Antesima


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jul 24, 2008 8:56 am 
Offline

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5
It doesn't seem to fit...

Here is the string I try to convert :

"00280057005800470048"

and the code I use :

Code:
private static string ConvertNumericUnicode(string strArgument_)
        {
            string strResult = null;
            string strCurrent = strArgument_;
            while (strCurrent.Length >= 4)
            {
                int iChar = Int32.Parse(strCurrent.Substring(0, 4), System.Globalization.NumberStyles.AllowHexSpecifier);
               
                char cTemp = (char)iChar;
                strResult += cTemp;
                strCurrent = strCurrent.Substring(4);
            }
            return strResult;
        }


A I missing something ?

Regards,
Antesima


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jul 24, 2008 3:12 pm 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3096
Location: Cologne, Germany
So it seems these are indices into the font subsets, not unicode character codes (would be too simple :cry: ); don't blame me, I warned you about it.

So you have to add another level of indirection by looking into the font table. That's not my area of expertise so I can't give you any clue.

The other solution: create ANSI PDF files ...

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jul 29, 2008 12:38 pm 
Offline

Joined: Wed Jul 02, 2008 9:33 am
Posts: 5
Ok thank you, I will try to get the fonts and extract the text.

If I manage to have some code that work, I will publish it here.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 314 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group