PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

ContentReader.ReadContent not working for special signs
https://forum.pdfsharp.net/viewtopic.php?f=2&t=4209
Page 1 of 1

Author:  peter.pazurik [ Tue Dec 15, 2020 10:06 am ]
Post subject:  ContentReader.ReadContent not working for special signs

I am reading a table in PDF using "ContentReader.ReadContent". Everything goes well except cells containing special language signs. For example word "Prüf" is read as "\03\0U\0\u0081\0I" and that does't seem to make sense in any type of encoding. Words without such special signs are read as they are... Is there something extra to use in order to read words with special language signs? Or should I use a completely different approach?

I am quite new in reading of PDF files, so please excuse me if solution is simple, but I did not find anything useful so far. Thank you very much for any help!

Peter

Author:  TH-Soft [ Tue Dec 15, 2020 10:19 am ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

Maybe you have to use the mapping table to decode the string.

Author:  peter.pazurik [ Tue Dec 15, 2020 10:58 am ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

thanks. But I am not sure what do you mean exactly. What type of string coding do you mean? Could you please send me some link to an existing solution/example?

Just to explain myself: is seems that the coding doesn't match with any type of known encoding. Another example:
"Änderungsbeschreibung" -> "\0b\0Q\0G\0H\0U\0X\0Q\0J\0V\0E\0H\0V\0F\0K\0U\0H\0L\0E\0X\0Q\0J".

Reading of words with standard signs works just fine...

TH-Soft wrote:
Maybe you have to use the mapping table to decode the string.

Author:  TH-Soft [ Tue Dec 15, 2020 11:53 am ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

I haven't seen your PDF yet, so I can speculate only.

PDFs can contain a subset of a font and in that case the PDF may contain a mapping table between index and character. Strings in the PDF contain the index of the subset, not the character code.

Author:  peter.pazurik [ Tue Dec 15, 2020 12:18 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

Oh, thank you! I will try to study this. Is there some example, where can I read how to get and utilize such mapping table?

TH-Soft wrote:
I haven't seen your PDF yet, so I can speculate only.

PDFs can contain a subset of a font and in that case the PDF may contain a mapping table between index and character. Strings in the PDF contain the index of the subset, not the character code.

Author:  TH-Soft [ Tue Dec 15, 2020 12:59 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

Here someone solved the task without sharing any code:
viewtopic.php?p=10564#p10564

I'm afraid I cannot point you to sample code.

Author:  peter.pazurik [ Wed Dec 16, 2020 9:37 am ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

ok, so I found a way how to get the mapping table - traverse through document elements "/Resources" -> "/Font" (PdfDictionary objects). This way I should get all used fonts, and they should have their own element "/ToUnicode" where a mapping table is defined. In my PDF document I found 4 fonts, but only one had "/ToUnicode" element, so I have only one mapping table.

Back to my example: "\03\0U\0\u0081\0I" -> " Prüf"
In my mapping table I was able to find 2 values
<0003> <0020> (meaning ' ' in Unicode)
<0081> <00FC> (meaning 'ü' in Unicode)
So what about the remaining values, for example "\0U"? How to map them when they are not in the mapping table? Is this really the right way to go, or is there some more convenient API to get the correct mapping table? I still feel that this is just a workaround and not the correct way how to get to the mapping table. I am still missing some important information.

TH-Soft wrote:
Here someone solved the task without sharing any code:
viewtopic.php?p=10564#p10564

I'm afraid I cannot point you to sample code.

Author:  TH-Soft [ Wed Dec 16, 2020 10:31 am ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

Maybe "\0U" translates as "\0055". 0x55 is "U".
Do you have that in the mapping table?

Author:  peter.pazurik [ Wed Dec 16, 2020 12:10 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

unfortunately not.

And what about my question, is my approach OK, or is there any special API/classes for working with those mapping tables?

FYI, this is the whle table, seems a bit small to me :)

{/CIDInit /ProcSet findresource begin
20 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
1 beginbfchar
<0003> <0020>
endbfchar
1 beginbfrange
<000B> <000C> <0028>
endbfrange
1 beginbfchar
<000F> <002C>
endbfchar
5 beginbfrange
<0011> <001D> <002E>
<0020> <0022> <003D>
<0024> <002C> <0041>
<002F> <0033> <004C>
<0035> <003A> <0052>
endbfrange
2 beginbfchar
<003D> <005A>
<0042> <005F>
endbfchar
2 beginbfrange
<0044> <004C> <0061>
<004E> <005D> <006B>
endbfrange
1 beginbfchar
<0062> <00C4>
endbfchar
1 beginbfrange
<0067> <0068> [<00D6> <00DC>]
endbfrange
8 beginbfchar
<006A> <00E0>
<006C> <00E4>
<0071> <00E8>
<007C> <00F6>
<0081> <00FC>
<0083> <00B0>
<008B> <00A9>
<00AB> <2026>
endbfchar
2 beginbfrange
<00B3> <00B4> <201C>
<00B5> <00B6> <2018>
endbfrange
2 beginbfchar
<00C4> <201E>
<00F0> <00B2>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end}

TH-Soft wrote:
Maybe "\0U" translates as "\0055". 0x55 is "U".
Do you have that in the mapping table?

Author:  peter.pazurik [ Wed Dec 16, 2020 12:27 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

by the way, I am really baffled by this. I am doing really basic stuff here. PDFsharp is in my opinion widely spread tool. Did no one encountered such problems before me? It cannot be that diffucult...

The code is super simple:

Code:
using (var _document = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly))
{
   var result = new StringBuilder();
   foreach (PdfPage page in _document.Pages)
   {
      ExtractText(ContentReader.ReadContent(page), result);
      result.AppendLine();
   }
}


Within the ExtractText method I am just parsing through elements in CSequence returned by the ReadContent method. When the element is of type CString, its mostly OK, but some of the words are not "translated". So an excerpt of a page looks like this: " |Status| |in Arbeit| |1| |von| |326| |Restricted| |\03\0U\0\u0081\0I| |1| |1| |1| |SAP"

Author:  TH-Soft [ Wed Dec 16, 2020 12:32 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".

Author:  peter.pazurik [ Wed Dec 16, 2020 12:47 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

PDFSharp was to only tool that provided me a way how to distinguish between table cells. At least in my PDF documents... In my ExtractText method I just replaced all Tj operators with '|' and when I get for example " |xxx| " I know that I have a text from one cell. Of course there are more possible cases/scenarios, so I have to build some logic around it, but so far I have not found anything else useful... However, I am completely new at this field, so I really appreciate any advice.

TH-Soft wrote:
PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".

Author:  peter.pazurik [ Thu Dec 17, 2020 4:45 pm ]
Post subject:  Re: ContentReader.ReadContent not working for special signs

Hello Stefan,

could you please elaborate how did you get from "\0I" (or "\0049") to "<0044> <004C> <0061>" and then to "\0055"? Is there any API for that mapping? I don't see the connection there...

TH-Soft wrote:
PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/