PDFsharp & MigraDoc Foundation
https://forum.pdfsharp.net/

Reading PDF contents?
https://forum.pdfsharp.net/viewtopic.php?f=2&t=452
Page 1 of 1

Author:  Megidolaon [ Tue Aug 19, 2008 10:54 am ]
Post subject:  Reading PDF contents?

Hello, I've just started using PDFSharp and I was wondering how you can read the content of a PDF.

I tried looping through the Pages.Elements Property of the PdfDocument class but I get an error that I cannot convert from DictionaryEntry to Typ DictionaryElements.

Alternatively I tried using the PdfContent class from the CreateSingleContent method of a PdfPage but all I get are a handful cryptic values (something like "7 0 R", "120 B" or such) as whole content of a Pdf containing text and a table with at least 50 values.

Also, is there a difference between reading normal text and the contents of a table?

Thanks in advance.

Author:  gkataria [ Wed Aug 20, 2008 11:26 am ]
Post subject: 

i was able to get the images of a page from below code, but still unable to find the text.

write below code in any click event

PdfDocument document = PdfReader.Open("C:\\HelloWorld.pdf", PdfDocumentOpenMode.ReadOnly);

int imageCount = 0;
// Iterate pages
foreach (PdfPage page in document.Pages)
{
// Get resources dictionary
PdfDictionary resources = page.Elements.GetDictionary("/Resources");
if (resources != null)
{
// Get external objects dictionary
PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
if (xObjects != null)
{
PdfItem[] items = xObjects.Elements.Values;
// Iterate references to external objects
foreach (PdfItem item in items)
{
PdfReference reference = item as PdfReference;
if (reference != null)
{
PdfDictionary xObject = reference.Value as PdfDictionary;
// Is external object an image?
if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
{
imageCount++;
ExportImage(xObject, imageCount);

}
}
}
}
}
}


the following functions are used:

/// <summary>
/// Currently extracts only JPEG images.
/// </summary>
static void ExportImage(PdfDictionary image, int count)
{
string filter = image.Elements.GetName("/Filter");
switch (filter)
{
case "/DCTDecode":
ExportJpegImage(image, count);
break;

case "/FlateDecode":
ExportAsPngImage(image, count);
break;
}
}

/// <summary>
/// Exports a JPEG image.
/// </summary>
static void ExportJpegImage(PdfDictionary image, int count)
{
// Fortunately JPEG has native support in PDF and exporting an image is just writing the stream to a file.
byte[] stream = image.Stream.Value;
//FileStream fs = new FileStream(String.Format("Image{0}.jpeg", count++), FileMode.Create, FileAccess.Write);
//fs.Read(
//BinaryWriter bw = new BinaryWriter(fs);
//bw.Write(stream);

File.WriteAllBytes("C:\\poc_image_" + count.ToString() + ".jpeg", stream);
//bw.Close();
}

Author:  blackjack2150 [ Thu Aug 21, 2008 7:41 am ]
Post subject: 

Hi. For text extraction you can use the PDFBox library. For .NET you also have to put a reference to IKVM in your code. An easy solution is using Text Mining Tool (which uses PDFBox). Just google it.

Author:  gkataria [ Tue Aug 26, 2008 12:56 pm ]
Post subject: 

But i actually needed to find each text and image objects position as well

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
https://www.phpbb.com/