PDFsharp 0.9 contains an early implementation of
PdfSharp.Pdf.Content. ContentReader that converts a content stream in a sequence of instances of objects derived form CObject. My current code also has a
ContentWriter that converts the objects back to a content stream.
This works fine, but it is just the beginning of the problem. The meaning of
Code:
<0044>Tj
can be determined only with the font that is used. PDF has no native support for Unicode. For using Unicode a so called CID (character ID) font must be derived from the underlying TrueType font. 0x0044 is NOT the Unicode character but the glyph id within the TrueType font the CID font is based on.
To reverse lookup which character corresponds to 0x0044 you must use the ToUnicodeMap. It maps glyph IDs to Unicode characters. In general glyph IDs and Unicode characters are not the same values.
To save space tools like InDesign typically embed only a subset of the Unicode fonts in the PDF file. The subset only contains the glyphs used in your document. To make the structure of the internal tables of this subset font easier, the glyphs are renumbered when the subset font is created. Without the corresponding ToUnicodeMap even Acrobat cannot 'read' your Unicode text anymore (i.e. it cannot copy selected text to the clipboard), even if you can read the text very well because you interpret the stroked glyphs…
Best you can do is to embed the whole font. Then the glyph IDs and Unicode values partially match with an offset. With the help of the ToUnicodeMap you can encode or decode the text.
Regards
Stefan Lange