PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sun Apr 11, 2021 10:12 am

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 13 posts ] 
Author Message
PostPosted: Tue Dec 15, 2020 10:06 am 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
I am reading a table in PDF using "ContentReader.ReadContent". Everything goes well except cells containing special language signs. For example word "Prüf" is read as "\03\0U\0\u0081\0I" and that does't seem to make sense in any type of encoding. Words without such special signs are read as they are... Is there something extra to use in order to read words with special language signs? Or should I use a completely different approach?

I am quite new in reading of PDF files, so please excuse me if solution is simple, but I did not find anything useful so far. Thank you very much for any help!

Peter


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 15, 2020 10:19 am 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 629
Location: CCAA
Maybe you have to use the mapping table to decode the string.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 15, 2020 10:58 am 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
thanks. But I am not sure what do you mean exactly. What type of string coding do you mean? Could you please send me some link to an existing solution/example?

Just to explain myself: is seems that the coding doesn't match with any type of known encoding. Another example:
"Änderungsbeschreibung" -> "\0b\0Q\0G\0H\0U\0X\0Q\0J\0V\0E\0H\0V\0F\0K\0U\0H\0L\0E\0X\0Q\0J".

Reading of words with standard signs works just fine...

TH-Soft wrote:
Maybe you have to use the mapping table to decode the string.


Last edited by peter.pazurik on Tue Dec 15, 2020 12:02 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 15, 2020 11:53 am 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 629
Location: CCAA
I haven't seen your PDF yet, so I can speculate only.

PDFs can contain a subset of a font and in that case the PDF may contain a mapping table between index and character. Strings in the PDF contain the index of the subset, not the character code.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 15, 2020 12:18 pm 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
Oh, thank you! I will try to study this. Is there some example, where can I read how to get and utilize such mapping table?

TH-Soft wrote:
I haven't seen your PDF yet, so I can speculate only.

PDFs can contain a subset of a font and in that case the PDF may contain a mapping table between index and character. Strings in the PDF contain the index of the subset, not the character code.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 15, 2020 12:59 pm 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 629
Location: CCAA
Here someone solved the task without sharing any code:
viewtopic.php?p=10564#p10564

I'm afraid I cannot point you to sample code.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 16, 2020 9:37 am 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
ok, so I found a way how to get the mapping table - traverse through document elements "/Resources" -> "/Font" (PdfDictionary objects). This way I should get all used fonts, and they should have their own element "/ToUnicode" where a mapping table is defined. In my PDF document I found 4 fonts, but only one had "/ToUnicode" element, so I have only one mapping table.

Back to my example: "\03\0U\0\u0081\0I" -> " Prüf"
In my mapping table I was able to find 2 values
<0003> <0020> (meaning ' ' in Unicode)
<0081> <00FC> (meaning 'ü' in Unicode)
So what about the remaining values, for example "\0U"? How to map them when they are not in the mapping table? Is this really the right way to go, or is there some more convenient API to get the correct mapping table? I still feel that this is just a workaround and not the correct way how to get to the mapping table. I am still missing some important information.

TH-Soft wrote:
Here someone solved the task without sharing any code:
viewtopic.php?p=10564#p10564

I'm afraid I cannot point you to sample code.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 16, 2020 10:31 am 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 629
Location: CCAA
Maybe "\0U" translates as "\0055". 0x55 is "U".
Do you have that in the mapping table?

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 16, 2020 12:10 pm 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
unfortunately not.

And what about my question, is my approach OK, or is there any special API/classes for working with those mapping tables?

FYI, this is the whle table, seems a bit small to me :)

{/CIDInit /ProcSet findresource begin
20 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
1 beginbfchar
<0003> <0020>
endbfchar
1 beginbfrange
<000B> <000C> <0028>
endbfrange
1 beginbfchar
<000F> <002C>
endbfchar
5 beginbfrange
<0011> <001D> <002E>
<0020> <0022> <003D>
<0024> <002C> <0041>
<002F> <0033> <004C>
<0035> <003A> <0052>
endbfrange
2 beginbfchar
<003D> <005A>
<0042> <005F>
endbfchar
2 beginbfrange
<0044> <004C> <0061>
<004E> <005D> <006B>
endbfrange
1 beginbfchar
<0062> <00C4>
endbfchar
1 beginbfrange
<0067> <0068> [<00D6> <00DC>]
endbfrange
8 beginbfchar
<006A> <00E0>
<006C> <00E4>
<0071> <00E8>
<007C> <00F6>
<0081> <00FC>
<0083> <00B0>
<008B> <00A9>
<00AB> <2026>
endbfchar
2 beginbfrange
<00B3> <00B4> <201C>
<00B5> <00B6> <2018>
endbfrange
2 beginbfchar
<00C4> <201E>
<00F0> <00B2>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end}

TH-Soft wrote:
Maybe "\0U" translates as "\0055". 0x55 is "U".
Do you have that in the mapping table?


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 16, 2020 12:27 pm 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
by the way, I am really baffled by this. I am doing really basic stuff here. PDFsharp is in my opinion widely spread tool. Did no one encountered such problems before me? It cannot be that diffucult...

The code is super simple:

Code:
using (var _document = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly))
{
   var result = new StringBuilder();
   foreach (PdfPage page in _document.Pages)
   {
      ExtractText(ContentReader.ReadContent(page), result);
      result.AppendLine();
   }
}


Within the ExtractText method I am just parsing through elements in CSequence returned by the ReadContent method. When the element is of type CString, its mostly OK, but some of the words are not "translated". So an excerpt of a page looks like this: " |Status| |in Arbeit| |1| |von| |326| |Restricted| |\03\0U\0\u0081\0I| |1| |1| |1| |SAP"


Last edited by peter.pazurik on Wed Dec 16, 2020 12:35 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 16, 2020 12:32 pm 
Offline
PDFsharp Expert
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 629
Location: CCAA
PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 16, 2020 12:47 pm 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
PDFSharp was to only tool that provided me a way how to distinguish between table cells. At least in my PDF documents... In my ExtractText method I just replaced all Tj operators with '|' and when I get for example " |xxx| " I know that I have a text from one cell. Of course there are more possible cases/scenarios, so I have to build some logic around it, but so far I have not found anything else useful... However, I am completely new at this field, so I really appreciate any advice.

TH-Soft wrote:
PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 17, 2020 4:45 pm 
Offline

Joined: Tue Dec 15, 2020 6:30 am
Posts: 8
Hello Stefan,

could you please elaborate how did you get from "\0I" (or "\0049") to "<0044> <004C> <0061>" and then to "\0055"? Is there any API for that mapping? I don't see the connection there...

TH-Soft wrote:
PDFsharp was not designed to extract text from PDF.

We have the line "<004E> <005D> <006B>". This line maps "\0U" aka "\0055" to 0x72 or "r".

"\0I" is "\0049". The "<0044> <004C> <0061>" fits. This gives us "\0066" or "f".


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group