PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Unicode Support in PDF Properties
https://forum.pdfsharp.net/viewtopic.php?f=4&t=860
Page 1 of 1

Author:  vbAddict [ Thu Sep 03, 2009 5:57 pm ]
Post subject:  Unicode Support in PDF Properties

It would be nice to support Unicode characters in PDF Document Properties. Specifically, I am trying to write Greek text to the document title and author, but all I get is some unreadable characters.

Author:  Thomas Hoevel [ Mon Nov 23, 2015 12:25 pm ]
Post subject:  Re: Unicode Support in PDF Properties

As of now you can use Unicode strings in properties with a little trick:
Code:
private static string EncodingHack(string str)
{
    var encoding = Encoding.BigEndianUnicode;
    var bytes = encoding.GetBytes(str);
    var sb = new StringBuilder();
    sb.Append((char)254);
    sb.Append((char)255);
    for (int i = 0; i < bytes.Length; ++i)
    {
        sb.Append((char)bytes[i]);
    }
    return sb.ToString();
}



And this is how to use the trick (text includes Greek and Cyrillic characters):
Code:
var document = new PdfDocument();
document.Info.Author = EncodingHack("ABCabcΑΒΓαβγАБВабв");
document.Info.Title = EncodingHack("ABCabcΑΒΓαβγАБВабв");


The trick was developed for and tested with PDFsharp 1.50.
Future versions of PDFsharp will most likely implement support for Unicode strings for properties. With those future versions, the trick will no longer be necessary - but most likely the trick will not do any harm.

Author:  ta221 [ Sat Jul 29, 2017 12:16 pm ]
Post subject:  Re: Unicode Support in PDF Properties

I implemented my own code using a new class: "UTF16BEEncoding"

This can be also called: "BigEndianUnicodeEncoding".

The class is very similar to the "RawUnicodeEncoding".
The only change in the new class is to to add (+ 8) to the byte size (instead of (2 * count) then write (2 * count + 2)).
It's important to add (+2) and (-2) in the correct places.

Additionally, in the class method "GetBytes" these lines should be included before the loop:
bytes[byteIndex++] = 0xFE;
bytes[byteIndex++] = 0xFF;

The class and methods should be implemented in the ToStringLiteral and ToHexStringLiteral with necessary declared enums for the new encoding type.

In the "PdfString" class for the constructor I have set the new flag for it to use the new encoding "BigEndianUnicodeEncoding".
Then, when you change the document's info (part of the Trailer section) using new strings - the text will be shown properly after PDF is saved.

Important!
I implemented the code on the project v1.32 since I have found encoding issues after saving pdf files using v1.50 beta 3b.

Update
As for finding a final solution which fully fixes the issue -
in the code where ScanHexadecimalString method in class Lexer: under the condition:
if(count > 2 && chars[0] == (char)0xFE && chars[1] == (char)0xFF)
the line should be added at the end of the if block:
return this.symbol = Symbol.BigEndianUnicodeHexString;

Note: The symbol called "BigEndianUnicodeHexString" should be declared.
Another thing is to update the ParseObject method in class Parser:
add the following inside the loop:
case Symbol.BigEndianUnicodeHexString:
this.stack.Shift(new PdfString(this.lexer.Token, PdfStringFlags.BigEndianUnicodeEncoding | PdfStringFlags.HexLiteral));
break;

Note: "PdfStringFlags.BigEndianUnicodeEncoding" should be declared, too.

Update 2:
I applied the fix to the latest version v1.50 beta 4!
"BigEndianUnicode" should be used instead of "Unicode" !

Author:  TH-Soft [ Sat Jul 29, 2017 2:46 pm ]
Post subject:  Re: Unicode Support in PDF Properties

Thomas Hoevel wrote:
The trick was developed for and tested with PDFsharp 1.50.
I made a test today with PDFsharp 1.50 beta 4: Unicode in properties works "out of the box" without requiring the "EncodingHack" or any change to the source code.

ta221 wrote:
I implemented the code on the project v1.32 since I have found encoding issues after saving pdf files using v1.50 beta 3.
PDFsharp 1.50 beta 4 is the supported version.
Dozens of bugs of version 1.32 have been fixed with version 1.50 beta 4, so it is the recommended version.
If you find issues with version 1.50 beta 4 (unmodified), please report them using the Issue Submission Template.

Author:  ta221 [ Sat Jul 29, 2017 3:31 pm ]
Post subject:  Re: Unicode Support in PDF Properties

I applied the fix to the latest version v1.50 beta 4!
"BigEndianUnicode" should be used instead of "Unicode" everywhere BigEndianUnicode was found in the PDF document as I described above.

It's working properly now after the fix is applied.
I have tested it on my end carefully.

Author:  TH-Soft [ Sat Jul 29, 2017 4:39 pm ]
Post subject:  Re: Unicode Support in PDF Properties

ta221 wrote:
I applied the fix to the latest version v1.50 beta 4!
Why apply the fix if it works without it?

ta221 wrote:
"BigEndianUnicode" should be used instead of "Unicode" everywhere it was found in the document as I described above.
Why? It works without those changes.

Author:  ta221 [ Sat Jul 29, 2017 5:01 pm ]
Post subject:  Re: Unicode Support in PDF Properties

TH-Soft wrote:
Why apply the fix if it works without it?

Why? It works without those changes.


I found encoding errors (gibrish text) on the document properties without applying that fix.

Author:  TH-Soft [ Sat Jul 29, 2017 5:06 pm ]
Post subject:  Re: Unicode Support in PDF Properties

ta221 wrote:
I found encoding errors (gibrish text) on the document properties without applying that fix.
Give me examples that allow me to replicate this.

I found no problems with "®" (which is not even Unicode), nor with other test strings I tried.

Author:  ta221 [ Sat Jul 29, 2017 5:29 pm ]
Post subject:  Re: Unicode Support in PDF Properties

I sent an example on a pm to you.
I have tested it on my end.

Author:  TH-Soft [ Sat Jul 29, 2017 5:37 pm ]
Post subject:  Re: Unicode Support in PDF Properties

ta221 wrote:
I sent an example on a pm to you.
You sent me a link to a PDF file. I do not get any problems with my test code.

Author:  TH-Soft [ Sat Jul 29, 2017 5:53 pm ]
Post subject:  Re: Unicode Support in PDF Properties

vbAddict wrote:
Specifically, I am trying to write Greek text to the document title and author, but all I get is some unreadable characters.
This works now with PDFsharp 1.50 beta 4 "out of the box".
So far I was not able to replicate any problems with PDFsharp 1.50 beta 4 with respect to Unicode in properties, therefore I think no patches or fixes are required.

The Issue Submission Template can be used to report problems with the unmodified PDFsharp 1.50 beta 4.

Author:  ta221 [ Sat Jul 29, 2017 6:08 pm ]
Post subject:  Re: Unicode Support in PDF Properties

Upon applying my change/fix, I have tested this on my end to verify that no need to apply changes to
the document properties using document.Info.
I have tested it using security encryption, too.

Futhermore, no special manipulations to the properties is required using document.Info.
I have tested the setString method, too.

I found this solution to resolve this case for me permanently.

Author:  TH-Soft [ Sat Jul 29, 2017 6:34 pm ]
Post subject:  Re: Unicode Support in PDF Properties

The Issue Submission Template can be used to report problems with the unmodified PDFsharp 1.50 beta 4.

Author:  ta221 [ Sun Jul 30, 2017 6:45 pm ]
Post subject:  Re: Unicode Support in PDF Properties

TH-Soft wrote:
The Issue Submission Template can be used to report problems with the unmodified PDFsharp 1.50 beta 4.


I reported on a pm about the problem with a full code to run which represents the encoding error. The example I sent imports a file, encrypts it with a user password and finally saves the file. Upon opening the encrypted file, the result will be shown.

I found that this issue occurs when encrypting the PDF file only using both 40 and 128 bit.
The gibrish text result of each encryption is a bit differnent, but gibrish is shown for both cases.

Applying the fix on my end as I described helped to resolve the encoding error issue when the PDF file is encrypted with a password.

Thank you for listening and I hope my information will be useful.

Author:  TH-Soft [ Sun Jul 30, 2017 9:02 pm ]
Post subject:  Re: Unicode Support in PDF Properties

ta221 wrote:
As for finding a final solution which fully fixes the issue -
in the code where ScanHexadecimalString method in class Lexer: under the condition:
if(count > 2 && chars[0] == (char)0xFE && chars[1] == (char)0xFF)
the line should be added at the end of the if block:
return this.symbol = Symbol.BigEndianUnicodeHexString;
The string is a "UnicodeHexString" and I'm afraid it might have unwanted side effects to introduce a new, secondary name "BigEndianUnicodeHexString" for "UnicodeHexString".
I can confirm there is a problem with Unicode properties and password protection. It's not the only problem with password protection.
Having two symbol types for one type of string is potentially harmful.

Bedtime for today. I'll have a closer look one of these days.

ta221 wrote:
The example I sent imports a file, encrypts it with a user password and finally saves the file.
The idea of the Issue Submission Template: users who encounter a problem send us a ZIP file with a complete Visual Studio solution that allows us to replicate the issue, including all PDF files etc. that are needed.
Thanks for the code fragment. I was able to replicate the issue.

Author:  ta221 [ Mon Jul 31, 2017 5:15 am ]
Post subject:  Re: Unicode Support in PDF Properties

TH-Soft wrote:
ta221 wrote:
As for finding a final solution which fully fixes the issue -
in the code where ScanHexadecimalString method in class Lexer: under the condition:
if(count > 2 && chars[0] == (char)0xFE && chars[1] == (char)0xFF)
the line should be added at the end of the if block:
return this.symbol = Symbol.BigEndianUnicodeHexString;
The string is a "UnicodeHexString" and I'm afraid it might have unwanted side effects to introduce a new, secondary name "BigEndianUnicodeHexString" for "UnicodeHexString".
I can confirm there is a problem with Unicode properties and password protection. It's not the only problem with password protection.
Having two symbol types for one type of string is potentially harmful.


It's important to say that I called the symbol with the name Symbol.BigEndianUnicodeHexString
However, I think it's more right to call this as Symbol.BigEndianUnicodeString - the same as Symbol.UnicodeString is called.

I did some tests yesterday including importing a PDF that was saved by PDFsharp previously.
After the various tests (including opening an encrypted PDF which was encrypted by PDFsharp) - I found no special issues, at least not to the document's properties.
I found that adding 0xFE and 0xFF before each UnicodeString / UnicodeHexString - each string which meets the the condition - chars[0] == (char)0xFE && chars[1] == (char)0xFF) is important for the successful of having unicode characters being displayed correctly upon reading an encrypted PDF file.
Also, I found this didn't harm non-encrypted files.

However, there may be the case that unforeseen issues may occur (I really hope that there are no issues).

Update as of 01/08:
After a further investigation, I decided the correct thing is to split cases to Symbol.BigEndianUnicodeHexString and Symbol.BigEndianUnicodeString to ensure a correct result of production.
Furtheremore, the new BigEndianUnicodeEncoding class which is based on RawUnicodeEncoding class should include (2 * count + 2) and (+2) and (-2) in the correct places instead of (+ 8/ - 8).
This is due to the fact that the only prefix "0xFE+0xFF" should be included and there are no more additional bytes beyond these.

Thank you for listening.

Author:  TH-Soft [ Thu Aug 03, 2017 8:56 am ]
Post subject:  Re: Unicode Support in PDF Properties

There was a TODO in "FormatStringLiteral" and the combination of Unicode and encryption was not handled properly.
I think I got it working now.
I will do some more testing, but I hope my changes will work and can be published soon.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/