PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Wed Apr 24, 2024 5:16 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 2 posts ] 
Author Message
 Post subject: PDF to Text
PostPosted: Mon Feb 17, 2014 7:38 pm 
Offline

Joined: Fri Feb 14, 2014 3:28 pm
Posts: 9
I wrote the following function to read the text out of a PDF file. It is pretty close, but I'm just not familiar enough with all the op codes to get the line spacing right. For example, I'm currently inserting a new line when I see "ET" but that doesn't seem quite right since it may just be the end of a text run, mid line.

Code:
    Public Function ReadPDFFile(filePath As String,
                                Optional maxLength As Integer = 0) As String

        Dim sbContents As New StringBuilder

        Dim cArrayType As Type = GetType(CArray)
        Dim cCommentType As Type = GetType(CComment)
        Dim cIntegerType As Type = GetType(CInteger)
        Dim cNameType As Type = GetType(CName)
        Dim cNumberType As Type = GetType(CNumber)
        Dim cOperatorType As Type = GetType(COperator)
        Dim cRealType As Type = GetType(CReal)
        Dim cSequenceType As Type = GetType(CSequence)
        Dim cStringType As Type = GetType(CString)
        Dim opCodeNameType As Type = GetType(OpCodeName)

        Dim ReadObject As Action(Of CObject) = Sub(obj As CObject)

                                                   Dim objType As Type = obj.GetType

                                                   Select Case objType
                                                       Case cArrayType
                                                           Dim arrObj As CArray = DirectCast(obj, CArray)
                                                           For Each member As CObject In arrObj
                                                               ReadObject(member)
                                                           Next
                                                       Case cOperatorType
                                                           Dim opObj As COperator = DirectCast(obj, COperator)
                                                           Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName)
                                                               Case "ET", "Tx"
                                                                   sbContents.Append(vbNewLine)
                                                               Case "Tj", "TJ"
                                                                   For Each operand As CObject In opObj.Operands
                                                                       ReadObject(operand)
                                                                   Next
                                                               Case "QuoteSingle", "QuoteDbl"
                                                                   sbContents.Append(vbNewLine)
                                                                   For Each operand As CObject In opObj.Operands
                                                                       ReadObject(operand)
                                                                   Next
                                                               Case Else
                                                                   'Do Nothing
                                                           End Select
                                                       Case cSequenceType
                                                           Dim seqObj As CSequence = DirectCast(obj, CSequence)
                                                           For Each member As CObject In seqObj
                                                               ReadObject(member)
                                                           Next
                                                       Case cStringType
                                                           sbContents.Append(DirectCast(obj, CString).Value)
                                                       Case cCommentType, cIntegerType, cNameType, cNumberType, cRealType
                                                           'Do Nothing
                                                       Case Else
                                                           Throw New NotImplementedException(obj.GetType().AssemblyQualifiedName)
                                                   End Select

                                               End Sub

        Using pd As PdfDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly)

            For Each page As PdfPage In pd.Pages

                ReadObject(ContentReader.ReadContent(page))

                If maxLength > 0 And sbContents.Length >= maxLength Then
                    If sbContents.Length > maxLength Then
                        sbContents.Remove(maxLength - 1, sbContents.Length - maxLength)
                    End If
                    Exit For
                End If

                sbContents.Append(vbNewLine)

            Next

        End Using

        Return sbContents.ToString

    End Function



My 2 questions:

1. Would you mind taking a second to just make any suggestions?
2. Is there some type of PDF encoding viewer I can download? Very difficult for me to understand the rules that make up a PDF without some type of tag visualization.


Thanks very much!


Top
 Profile  
Reply with quote  
 Post subject: Re: PDF to Text
PostPosted: Mon Feb 17, 2014 7:40 pm 
Offline

Joined: Fri Feb 14, 2014 3:28 pm
Posts: 9
Also, if you are curious, the reason I am doing:

Code:
Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName)


instead of just:

Code:
Select Case opObj.OpCode.OpCodeName


is because VB is case-insensitive, so Tj and TJ are the same thing.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 2 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 205 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group