Hello,
the content of a PDF page is a sequence of bytes that represents graphical commands. These bytes are called the "content stream" of the page. You can get it uncompressed with this code:
Code:
page.Contents.CreateSingleContent().Stream.UnfilteredValue;
A "Hello, World" page may look like this:
Code:
1 0 0 1 0 841.8898 cm
1 0 0 -1 0 0 cm
BT
-100 Tz
/F0 -10 Tf
1 0 0 1 70.8661 80.9199 Tm
-10 TL
(Hello)Tj
/F0 -10 Tf
1 0 0 1 99.4111 80.9199 Tm
(World!)Tj
ET
You can find
Code:
(Hello)
and
Code:
(World!)
as strings.
You should find
Code:
([Customer:xxxxxxxx])
in your PDF file. This is easy to parse. Try PdfSharp Explorer to analyse your PDF.
But depending on the PDF producer application you find this:
Code:
[(H)42(e)32(l)37(l)33(0)]TJ
There is kerning information (distance adjustment) between the characters. Adobe Acrobat never creates this if you use a fixed size font like Courier. Unfortunately tools like FreePDF always creates distance information, even if it is superfluous.
We at empira currently have the same problem to identify address information in PDF files and split it into single files. We recommend using Adobe Acrobat as producer and Courier New as font for the information text.
Further I wrote the class PdfSharp.Pdf.Content.ContentReader to convert a content stream into a squence of operation (it is in the current source code). Maybe this reader helps you to find your text.
I will publish our solution if we have one (currently we are working on other things).
Regards
Stefan Lange