PDFsharp & MigraDoc Foundation • View topic - ContentReader.ReadContent not working for special signs

View unanswered posts | View active topics

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Forum rules

Please read this before posting on this forum: Forum Rules

ContentReader.ReadContent not working for special signs

Moderator: Stefan Lange

Page 1 of 1

[ 13 posts ]

Print view

Previous topic | Next topic

Author

Message

peter.pazurik

Post subject: ContentReader.ReadContent not working for special signs

Posted: Tue Dec 15, 2020 10:06 am

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

I am reading a table in PDF using "ContentReader.ReadContent". Everything goes well except cells containing special language signs. For example word "Prüf" is read as "\03\0U\0\u0081\0I" and that does't seem to make sense in any type of encoding. Words without such special signs are read as they are... Is there something extra to use in order to read words with special language signs? Or should I use a completely different approach?

I am quite new in reading of PDF files, so please excuse me if solution is simple, but I did not find anything useful so far. Thank you very much for any help!

Peter

Top

TH-Soft

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Tue Dec 15, 2020 10:19 am

PDFsharp Expert

Joined: Sat Mar 14, 2015 10:15 am
Posts: 916
Location: CCAA

Maybe you have to use the mapping table to decode the string.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Tue Dec 15, 2020 10:58 am

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

thanks. But I am not sure what do you mean exactly. What type of string coding do you mean? Could you please send me some link to an existing solution/example?

Just to explain myself: is seems that the coding doesn't match with any type of known encoding. Another example:
"Änderungsbeschreibung" -> "\0b\0Q\0G\0H\0U\0X\0Q\0J\0V\0E\0H\0V\0F\0K\0U\0H\0L\0E\0X\0Q\0J".

Reading of words with standard signs works just fine...

TH-Soft wrote:

Maybe you have to use the mapping table to decode the string.

Last edited by peter.pazurik on Tue Dec 15, 2020 12:02 pm, edited 1 time in total.

Top

TH-Soft

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Tue Dec 15, 2020 11:53 am

PDFsharp Expert

Joined: Sat Mar 14, 2015 10:15 am
Posts: 916
Location: CCAA

I haven't seen your PDF yet, so I can speculate only.

PDFs can contain a subset of a font and in that case the PDF may contain a mapping table between index and character. Strings in the PDF contain the index of the subset, not the character code.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Tue Dec 15, 2020 12:18 pm

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

Oh, thank you! I will try to study this. Is there some example, where can I read how to get and utilize such mapping table?

TH-Soft wrote:

Top

TH-Soft

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Tue Dec 15, 2020 12:59 pm

PDFsharp Expert

Joined: Sat Mar 14, 2015 10:15 am
Posts: 916
Location: CCAA

Here someone solved the task without sharing any code:
viewtopic.php?p=10564#p10564

I'm afraid I cannot point you to sample code.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Wed Dec 16, 2020 9:37 am

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

ok, so I found a way how to get the mapping table - traverse through document elements "/Resources" -> "/Font" (PdfDictionary objects). This way I should get all used fonts, and they should have their own element "/ToUnicode" where a mapping table is defined. In my PDF document I found 4 fonts, but only one had "/ToUnicode" element, so I have only one mapping table.

Back to my example: "\03\0U\0\u0081\0I" -> " Prüf"
In my mapping table I was able to find 2 values
<0003> <0020> (meaning ' ' in Unicode)
<0081> <00FC> (meaning 'ü' in Unicode)
So what about the remaining values, for example "\0U"? How to map them when they are not in the mapping table? Is this really the right way to go, or is there some more convenient API to get the correct mapping table? I still feel that this is just a workaround and not the correct way how to get to the mapping table. I am still missing some important information.

TH-Soft wrote:

Here someone solved the task without sharing any code:
viewtopic.php?p=10564#p10564

I'm afraid I cannot point you to sample code.

Top

TH-Soft

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Wed Dec 16, 2020 10:31 am

PDFsharp Expert

Joined: Sat Mar 14, 2015 10:15 am
Posts: 916
Location: CCAA

Maybe "\0U" translates as "\0055". 0x55 is "U".
Do you have that in the mapping table?

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Wed Dec 16, 2020 12:10 pm

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

unfortunately not.

And what about my question, is my approach OK, or is there any special API/classes for working with those mapping tables?

FYI, this is the whle table, seems a bit small to me

{/CIDInit /ProcSet findresource begin
20 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
1 beginbfchar
<0003> <0020>
endbfchar
1 beginbfrange
<000B> <000C> <0028>
endbfrange
1 beginbfchar
<000F> <002C>
endbfchar
5 beginbfrange
<0011> <001D> <002E>
<0020> <0022> <003D>
<0024> <002C> <0041>
<002F> <0033> <004C>
<0035> <003A> <0052>
endbfrange
2 beginbfchar
<003D> <005A>
<0042> <005F>
endbfchar
2 beginbfrange
<0044> <004C> <0061>
<004E> <005D> <006B>
endbfrange
1 beginbfchar
<0062> <00C4>
endbfchar
1 beginbfrange
<0067> <0068> [<00D6> <00DC>]
endbfrange
8 beginbfchar
<006A> <00E0>
<006C> <00E4>
<0071> <00E8>
<007C> <00F6>
<0081> <00FC>
<0083> <00B0>
<008B> <00A9>
<00AB> <2026>
endbfchar
2 beginbfrange
<00B3> <00B4> <201C>
<00B5> <00B6> <2018>
endbfrange
2 beginbfchar
<00C4> <201E>
<00F0> <00B2>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end}

TH-Soft wrote:

Maybe "\0U" translates as "\0055". 0x55 is "U".
Do you have that in the mapping table?

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Wed Dec 16, 2020 12:27 pm

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

by the way, I am really baffled by this. I am doing really basic stuff here. PDFsharp is in my opinion widely spread tool. Did no one encountered such problems before me? It cannot be that diffucult...

The code is super simple:

Code:

using (var _document = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly))
{
   var result = new StringBuilder();
   foreach (PdfPage page in _document.Pages)
   {
      ExtractText(ContentReader.ReadContent(page), result);
      result.AppendLine();
   }
}

Within the ExtractText method I am just parsing through elements in CSequence returned by the ReadContent method. When the element is of type CString, its mostly OK, but some of the words are not "translated". So an excerpt of a page looks like this: " |Status| |in Arbeit| |1| |von| |326| |Restricted| |\03\0U\0\u0081\0I| |1| |1| |1| |SAP"

Last edited by peter.pazurik on Wed Dec 16, 2020 12:35 pm, edited 1 time in total.

Top

TH-Soft

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Wed Dec 16, 2020 12:32 pm

PDFsharp Expert

Joined: Sat Mar 14, 2015 10:15 am
Posts: 916
Location: CCAA

PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Wed Dec 16, 2020 12:47 pm

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

PDFSharp was to only tool that provided me a way how to distinguish between table cells. At least in my PDF documents... In my ExtractText method I just replaced all Tj operators with '|' and when I get for example " |xxx| " I know that I have a text from one cell. Of course there are more possible cases/scenarios, so I have to build some logic around it, but so far I have not found anything else useful... However, I am completely new at this field, so I really appreciate any advice.

TH-Soft wrote:

Top

peter.pazurik

Post subject: Re: ContentReader.ReadContent not working for special signs

Posted: Thu Dec 17, 2020 4:45 pm

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8

Hello Stefan,

could you please elaborate how did you get from "\0I" (or "\0049") to "<0044> <004C> <0061>" and then to "\0055"? Is there any API for that mapping? I don't see the connection there...

TH-Soft wrote:

Top

Page 1 of 1

[ 13 posts ]

Board index » PDFsharp & MigraDoc » Support

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 354 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum