PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Mon Dec 22, 2014 8:32 pm

All times are UTC




Post new topic Reply to topic  [ 11 posts ] 
Author Message
 Post subject: Search text in a PDF
PostPosted: Wed Oct 20, 2010 3:43 pm 
Offline

Joined: Wed Oct 20, 2010 3:36 pm
Posts: 7
Hi,
Is it possible to search some text in a PDF?
Thanks


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Thu Oct 28, 2010 8:01 am 
Offline

Joined: Wed Oct 20, 2010 3:36 pm
Posts: 7
There are no answers so I assume that it is not possible.


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Thu Oct 28, 2010 9:01 am 
Offline
empira Employee
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 2196
Location: Cologne, Germany
ldd wrote:
Is it possible to search some text in a PDF?

Yes, it's possible.

But you must extract the text yourself, there is no Search function in PDFsharp.

Improving the functions for text extraction from PDF files is on our wish list

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Thu Oct 28, 2010 1:12 pm 
Offline

Joined: Wed Oct 20, 2010 3:36 pm
Posts: 7
Thank you


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Tue Nov 02, 2010 2:51 pm 
Offline

Joined: Mon Nov 01, 2010 5:38 pm
Posts: 2
Right now we use PDFBox to extract text. They have something called PDFTextStripperByArea where you can provide a searchArea and it will get the text from that area provided. Does PDFSharp have something like that where I can get the text myself?
Here is a part of the PDFBox code:
Code:
public static String extractSSN(final PDPage page) throws IOException
   {
      // Stripper object.
      final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
      stripper.setSortByPosition(true);

      // Set the area to search on the PDF.
      final Rectangle searchArea = new Rectangle(0, 130, 100, 10);
      stripper.addRegion("ssn", searchArea);

      // Extract the text from the area, then pluck the ssn from it.
      stripper.extractRegions(page);
      String text = stripper.getTextForRegion("ssn").replaceAll("c", "%");
      text = URLDecoder.decode(text, "UTF-8");

      // Return the portion of the string we need.
      String output = "";

      try
      {
         output = text.substring(18, 29);
      }
      catch (final Exception ex)
      {
         output = "EMPTY";
      }

      return output;
   }


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Fri Nov 26, 2010 5:52 am 
Offline

Joined: Fri Nov 26, 2010 5:39 am
Posts: 1
Location: South Africa
Thomas Hoevel wrote:
Yes, it's possible.

But you must extract the text yourself, there is no Search function in PDFsharp.



Hi Thomas.

Please could you be as kind as to give us a few pointers on how to search for text in PDF documents. I would be most grateful for some direction on how to go about reading the text in pdf files using the pdfsharp library.

How would I extract the text myself :?:

Many thanks in anticipation.

Regards
Craig Gers
Code:
using System;
using System.Diagnostics;
using System.IO;
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;


namespace SearchAndReplace
{
    class Program
    {
        static void Main(string[] args)
        {

            // Open an existing document for editing and draw on first page
            PdfDocument document = PdfReader.Open("HelloWorld.pdf");

            int PageCount = document.Pages.Count;
            PdfPage page = null;

            for (int j = 0; j < PageCount; j++)
            {
                page = document.Pages[j];
               //What action do I need to perform on a page to get the text?
            }

        }
    }
}

_________________
Craig Gers
Sandbox Projects
South Africa


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Mon Nov 29, 2010 6:25 am 
Offline

Joined: Mon Nov 29, 2010 6:09 am
Posts: 2
Hi Craig,

I have just started using PdfSharp so I'm learning as well.

I wrote this method to extract text from a pdf, maybe it will help you. It uses a class called PDFTextExtractor which I found online, I have attached the class to this post.


Code:
  public PDFParser()
{
    var streamWriter = new StreamWriter("output.txt", false);

    String outputText = "";

    try
    {
       PdfDocument inputDocument = PdfReader.Open("input.pdf", PdfDocumentOpenMode.ReadOnly);

       foreach (PdfPage page in inputDocument.Pages)
       {
           for (int index = 0; index < page.Contents.Elements.Count; index++)
           {
               PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(index).Stream;
               outputText = new PDFTextExtractor().ExtractTextFromPDFBytes(stream.Value);

               streamWriter.WriteLine(outputText);
            }
       }

    }
    catch (Exception e)
    {

    }
            streamWriter.Close();
}



This worked on the pdfs I had to test with. If you need to search the text you can look into Regular Expression or use the methods of the string class in .net.


Good luck,
Filip


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Mon Nov 29, 2010 6:27 am 
Offline

Joined: Mon Nov 29, 2010 6:09 am
Posts: 2
OK, I dont think my attachment worked. Here is the class PDFTextExtractor:
Code:
public class PDFTextExtractor
{
    /// BT = Beginning of a text object operator
    /// ET = End of a text object operator
    /// Td move to the start of next line
    ///  5 Ts = superscript
    /// -5 Ts = subscript

    #region Fields

    #region _numberOfCharsToKeep
    /// <summary>
    /// The number of characters to keep, when extracting text.
    /// </summary>
    private static int _numberOfCharsToKeep = 15;
    #endregion

    #endregion



    #region ExtractTextFromPDFBytes
    /// <summary>
    /// This method processes an uncompressed Adobe (text) object
    /// and extracts text.
    /// </summary>
    /// <param name="input">uncompressed</param>
    /// <returns></returns>
    public string ExtractTextFromPDFBytes(byte[] input)
    {
        if (input == null || input.Length == 0) return "";

        try
        {
            string resultString = "";

            // Flag showing if we are we currently inside a text object
            bool inTextObject = false;

            // Flag showing if the next character is literal
            // e.g. '\\' to get a '\' character or '\(' to get '('
            bool nextLiteral = false;

            // () Bracket nesting level. Text appears inside ()
            int bracketDepth = 0;

            // Keep previous chars to get extract numbers etc.:
            char[] previousCharacters = new char[_numberOfCharsToKeep];
            for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';


            for (int i = 0; i < input.Length; i++)
            {
                char c = (char)input[i];

                if (inTextObject)
                {
                    // Position the text
                    if (bracketDepth == 0)
                    {
                        if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                        {
                            resultString += "\n\r";
                        }
                        else
                        {
                            if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
                            {
                                resultString += "\n";
                            }
                            else
                            {
                                if (CheckToken(new string[] { "Tj" }, previousCharacters))
                                {
                                    resultString += " ";
                                }
                            }
                        }
                    }

                    // End of a text object, also go to a new line.
                    if (bracketDepth == 0 &&
                        CheckToken(new string[] { "ET" }, previousCharacters))
                    {

                        inTextObject = false;
                        resultString += " ";
                    }
                    else
                    {
                        // Start outputting text
                        if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                        {
                            bracketDepth = 1;
                        }
                        else
                        {
                            // Stop outputting text
                            if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                            {
                                bracketDepth = 0;
                            }
                            else
                            {
                                // Just a normal text character:
                                if (bracketDepth == 1)
                                {
                                    // Only print out next character no matter what.
                                    // Do not interpret.
                                    if (c == '\\' && !nextLiteral)
                                    {
                                        nextLiteral = true;
                                    }
                                    else
                                    {
                                        if (((c >= ' ') && (c <= '~')) ||
                                            ((c >= 128) && (c < 255)))
                                        {
                                            resultString += c.ToString();
                                        }

                                        nextLiteral = false;
                                    }
                                }
                            }
                        }
                    }
                }

                // Store the recent characters for
                // when we have to go back for a checking
                for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
                {
                    previousCharacters[j] = previousCharacters[j + 1];
                }
                previousCharacters[_numberOfCharsToKeep - 1] = c;

                // Start of a text object
                if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
                {
                    inTextObject = true;
                }
            }
            return resultString;
        }
        catch
        {
            return "";
        }
    }
    #endregion

    #region CheckToken
    /// <summary>
    /// Check if a certain 2 character token just came along (e.g. BT)
    /// </summary>
    /// <param name="search">the searched token</param>
    /// <param name="recent">the recent character array</param>
    /// <returns></returns>
    private bool CheckToken(string[] tokens, char[] recent)
    {
        foreach (string token in tokens)
        {
            if (token.Length > 1)
            {
                if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
                    (recent[_numberOfCharsToKeep - 2] == token[1]) &&
                    ((recent[_numberOfCharsToKeep - 1] == ' ') ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
                    ((recent[_numberOfCharsToKeep - 4] == ' ') ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
                    (recent[_numberOfCharsToKeep - 4] == 0x0a))
                    )
                {
                    return true;
                }
            }
            else
            {
                return false;
            }

        }
        return false;
    }
    #endregion
}


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Fri Jan 28, 2011 4:24 pm 
Offline

Joined: Fri Jan 28, 2011 4:20 pm
Posts: 1
I created a pdf file with openoffice and then I tried your class but it's return only new line or carriage return.
Do you know why.
Thanks in advance


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Fri Feb 18, 2011 6:19 pm 
Offline

Joined: Fri Feb 18, 2011 6:16 pm
Posts: 1
I've had the same problem here - I created a PDF in OpenOffice with some sample text, but only get empty strings back. Were you able to make any progress with this?

Thanks!


Top
 Profile  
 
 Post subject: Re: Search text in a PDF
PostPosted: Sun Mar 20, 2011 4:58 pm 
Offline

Joined: Sun Mar 20, 2011 4:43 pm
Posts: 1
Here's what I did. With this solution, all I was interested in was removing pages that did not have a user specified string in them. I did not care where the string was placed, simply if it existed.

Code:
        private void RemoveUnReferencedPages(PdfDocument document, string referenceString)
{
            // this procedure removes any pages from the pdf document that do not contain
            // the reference string

            int pageNo = -1;
            string strStreamValue;
            byte[] streamValue;
            bool[] keepPageArray;
            keepPageArray = new bool[document.Pages.Count];

            // iterate through the pages
            foreach (PdfPage page in document.Pages)
            {
                pageNo++;
                strStreamValue = "";

                // put the stream value for every element on the page in a string variable.
                for (int i = 0; i < page.Contents.Elements.Count; i++)
                {
                    PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(i).Stream;
                    streamValue = stream.Value;
                    foreach (byte b in streamValue)
                    {
                        strStreamValue += (char)b;
                    }
                }
                // flag those pages that contain the reference value
                keepPageArray[pageNo] = strStreamValue.Contains(referenceString);
            }

            // Now, remove the pages we identified.  We're doing this in reverse order
            // because the deletion of an earlier page moves the rest of the pages up
            // on page.  This keeps us from deleting the wrong pages.
            for (int i = keepPageArray.Length - 1; i > -1; i--)
            {
                if (!keepPageArray[i])
                {
                    PdfPage deletePage = document.Pages[i];
                    document.Pages.Remove(deletePage);
                }
            }
}


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC


Who is online

Users browsing this forum: Bing [Bot], Yahoo [Bot] and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group