PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Sat Oct 21, 2017 10:51 pm

All times are UTC




Post new topic Reply to topic  [ 17 posts ] 
Author Message
PostPosted: Thu Sep 03, 2009 5:57 pm 
Offline

Joined: Thu Sep 03, 2009 5:45 pm
Posts: 1
Location: Greece
It would be nice to support Unicode characters in PDF Document Properties. Specifically, I am trying to write Greek text to the document title and author, but all I get is some unreadable characters.


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 23, 2015 12:25 pm 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2720
Location: Cologne, Germany
As of now you can use Unicode strings in properties with a little trick:
Code:
private static string EncodingHack(string str)
{
    var encoding = Encoding.BigEndianUnicode;
    var bytes = encoding.GetBytes(str);
    var sb = new StringBuilder();
    sb.Append((char)254);
    sb.Append((char)255);
    for (int i = 0; i < bytes.Length; ++i)
    {
        sb.Append((char)bytes[i]);
    }
    return sb.ToString();
}



And this is how to use the trick (text includes Greek and Cyrillic characters):
Code:
var document = new PdfDocument();
document.Info.Author = EncodingHack("ABCabcΑΒΓαβγАБВабв");
document.Info.Title = EncodingHack("ABCabcΑΒΓαβγАБВабв");


The trick was developed for and tested with PDFsharp 1.50.
Future versions of PDFsharp will most likely implement support for Unicode strings for properties. With those future versions, the trick will no longer be necessary - but most likely the trick will not do any harm.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 12:16 pm 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
I implemented my own code using a new class: "UTF16BEEncoding"

This can be also called: "BigEndianUnicodeEncoding".

The class is very similar to the "RawUnicodeEncoding".
The only change in the new class is to to add (+ 8) to the byte size (instead of (2 * count) then write (2 * count + 2)).
It's important to add (+2) and (-2) in the correct places.

Additionally, in the class method "GetBytes" these lines should be included before the loop:
bytes[byteIndex++] = 0xFE;
bytes[byteIndex++] = 0xFF;

The class and methods should be implemented in the ToStringLiteral and ToHexStringLiteral with necessary declared enums for the new encoding type.

In the "PdfString" class for the constructor I have set the new flag for it to use the new encoding "BigEndianUnicodeEncoding".
Then, when you change the document's info (part of the Trailer section) using new strings - the text will be shown properly after PDF is saved.

Important!
I implemented the code on the project v1.32 since I have found encoding issues after saving pdf files using v1.50 beta 3b.

Update
As for finding a final solution which fully fixes the issue -
in the code where ScanHexadecimalString method in class Lexer: under the condition:
if(count > 2 && chars[0] == (char)0xFE && chars[1] == (char)0xFF)
the line should be added at the end of the if block:
return this.symbol = Symbol.BigEndianUnicodeHexString;

Note: The symbol called "BigEndianUnicodeHexString" should be declared.
Another thing is to update the ParseObject method in class Parser:
add the following inside the loop:
case Symbol.BigEndianUnicodeHexString:
this.stack.Shift(new PdfString(this.lexer.Token, PdfStringFlags.BigEndianUnicodeEncoding | PdfStringFlags.HexLiteral));
break;

Note: "PdfStringFlags.BigEndianUnicodeEncoding" should be declared, too.

Update 2:
I applied the fix to the latest version v1.50 beta 4!
"BigEndianUnicode" should be used instead of "Unicode" !


Last edited by ta221 on Tue Aug 01, 2017 2:50 pm, edited 3 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 2:46 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
Thomas Hoevel wrote:
The trick was developed for and tested with PDFsharp 1.50.
I made a test today with PDFsharp 1.50 beta 4: Unicode in properties works "out of the box" without requiring the "EncodingHack" or any change to the source code.

ta221 wrote:
I implemented the code on the project v1.32 since I have found encoding issues after saving pdf files using v1.50 beta 3.
PDFsharp 1.50 beta 4 is the supported version.
Dozens of bugs of version 1.32 have been fixed with version 1.50 beta 4, so it is the recommended version.
If you find issues with version 1.50 beta 4 (unmodified), please report them using the Issue Submission Template.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 3:31 pm 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
I applied the fix to the latest version v1.50 beta 4!
"BigEndianUnicode" should be used instead of "Unicode" everywhere BigEndianUnicode was found in the PDF document as I described above.

It's working properly now after the fix is applied.
I have tested it on my end carefully.


Last edited by ta221 on Sat Jul 29, 2017 4:39 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 4:39 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
ta221 wrote:
I applied the fix to the latest version v1.50 beta 4!
Why apply the fix if it works without it?

ta221 wrote:
"BigEndianUnicode" should be used instead of "Unicode" everywhere it was found in the document as I described above.
Why? It works without those changes.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 5:01 pm 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
TH-Soft wrote:
Why apply the fix if it works without it?

Why? It works without those changes.


I found encoding errors (gibrish text) on the document properties without applying that fix.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 5:06 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
ta221 wrote:
I found encoding errors (gibrish text) on the document properties without applying that fix.
Give me examples that allow me to replicate this.

I found no problems with "®" (which is not even Unicode), nor with other test strings I tried.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 5:29 pm 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
I sent an example on a pm to you.
I have tested it on my end.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 5:37 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
ta221 wrote:
I sent an example on a pm to you.
You sent me a link to a PDF file. I do not get any problems with my test code.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 5:53 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
vbAddict wrote:
Specifically, I am trying to write Greek text to the document title and author, but all I get is some unreadable characters.
This works now with PDFsharp 1.50 beta 4 "out of the box".
So far I was not able to replicate any problems with PDFsharp 1.50 beta 4 with respect to Unicode in properties, therefore I think no patches or fixes are required.

The Issue Submission Template can be used to report problems with the unmodified PDFsharp 1.50 beta 4.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 6:08 pm 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
Upon applying my change/fix, I have tested this on my end to verify that no need to apply changes to
the document properties using document.Info.
I have tested it using security encryption, too.

Futhermore, no special manipulations to the properties is required using document.Info.
I have tested the setString method, too.

I found this solution to resolve this case for me permanently.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 29, 2017 6:34 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
The Issue Submission Template can be used to report problems with the unmodified PDFsharp 1.50 beta 4.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 30, 2017 6:45 pm 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
TH-Soft wrote:
The Issue Submission Template can be used to report problems with the unmodified PDFsharp 1.50 beta 4.


I reported on a pm about the problem with a full code to run which represents the encoding error. The example I sent imports a file, encrypts it with a user password and finally saves the file. Upon opening the encrypted file, the result will be shown.

I found that this issue occurs when encrypting the PDF file only using both 40 and 128 bit.
The gibrish text result of each encryption is a bit differnent, but gibrish is shown for both cases.

Applying the fix on my end as I described helped to resolve the encoding error issue when the PDF file is encrypted with a password.

Thank you for listening and I hope my information will be useful.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 30, 2017 9:02 pm 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
ta221 wrote:
As for finding a final solution which fully fixes the issue -
in the code where ScanHexadecimalString method in class Lexer: under the condition:
if(count > 2 && chars[0] == (char)0xFE && chars[1] == (char)0xFF)
the line should be added at the end of the if block:
return this.symbol = Symbol.BigEndianUnicodeHexString;
The string is a "UnicodeHexString" and I'm afraid it might have unwanted side effects to introduce a new, secondary name "BigEndianUnicodeHexString" for "UnicodeHexString".
I can confirm there is a problem with Unicode properties and password protection. It's not the only problem with password protection.
Having two symbol types for one type of string is potentially harmful.

Bedtime for today. I'll have a closer look one of these days.

ta221 wrote:
The example I sent imports a file, encrypts it with a user password and finally saves the file.
The idea of the Issue Submission Template: users who encounter a problem send us a ZIP file with a complete Visual Studio solution that allows us to replicate the issue, including all PDF files etc. that are needed.
Thanks for the code fragment. I was able to replicate the issue.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 31, 2017 5:15 am 
Offline

Joined: Sat Jul 29, 2017 11:53 am
Posts: 7
TH-Soft wrote:
ta221 wrote:
As for finding a final solution which fully fixes the issue -
in the code where ScanHexadecimalString method in class Lexer: under the condition:
if(count > 2 && chars[0] == (char)0xFE && chars[1] == (char)0xFF)
the line should be added at the end of the if block:
return this.symbol = Symbol.BigEndianUnicodeHexString;
The string is a "UnicodeHexString" and I'm afraid it might have unwanted side effects to introduce a new, secondary name "BigEndianUnicodeHexString" for "UnicodeHexString".
I can confirm there is a problem with Unicode properties and password protection. It's not the only problem with password protection.
Having two symbol types for one type of string is potentially harmful.


It's important to say that I called the symbol with the name Symbol.BigEndianUnicodeHexString
However, I think it's more right to call this as Symbol.BigEndianUnicodeString - the same as Symbol.UnicodeString is called.

I did some tests yesterday including importing a PDF that was saved by PDFsharp previously.
After the various tests (including opening an encrypted PDF which was encrypted by PDFsharp) - I found no special issues, at least not to the document's properties.
I found that adding 0xFE and 0xFF before each UnicodeString / UnicodeHexString - each string which meets the the condition - chars[0] == (char)0xFE && chars[1] == (char)0xFF) is important for the successful of having unicode characters being displayed correctly upon reading an encrypted PDF file.
Also, I found this didn't harm non-encrypted files.

However, there may be the case that unforeseen issues may occur (I really hope that there are no issues).

Update as of 01/08:
After a further investigation, I decided the correct thing is to split cases to Symbol.BigEndianUnicodeHexString and Symbol.BigEndianUnicodeString to ensure a correct result of production.
Furtheremore, the new BigEndianUnicodeEncoding class which is based on RawUnicodeEncoding class should include (2 * count + 2) and (+2) and (-2) in the correct places instead of (+ 8/ - 8).
This is due to the fact that the only prefix "0xFE+0xFF" should be included and there are no more additional bytes beyond these.

Thank you for listening.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 03, 2017 8:56 am 
Offline
User avatar

Joined: Sat Mar 14, 2015 10:15 am
Posts: 287
Location: CCAA
There was a TODO in "FormatStringLiteral" and the combination of Unicode and encryption was not handled properly.
I think I got it working now.
I will do some more testing, but I hope my changes will work and can be published soon.

_________________
Best regards
Thomas
(Freelance Software Developer with several years of MigraDoc/PDFsharp experience)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group