I wrote the following function to read the text out of a PDF file. It is pretty close, but I'm just not familiar enough with all the op codes to get the line spacing right. For example, I'm currently inserting a new line when I see "ET" but that doesn't seem quite right since it may just be the end of a text run, mid line.
Code:
Public Function ReadPDFFile(filePath As String,
Optional maxLength As Integer = 0) As String
Dim sbContents As New StringBuilder
Dim cArrayType As Type = GetType(CArray)
Dim cCommentType As Type = GetType(CComment)
Dim cIntegerType As Type = GetType(CInteger)
Dim cNameType As Type = GetType(CName)
Dim cNumberType As Type = GetType(CNumber)
Dim cOperatorType As Type = GetType(COperator)
Dim cRealType As Type = GetType(CReal)
Dim cSequenceType As Type = GetType(CSequence)
Dim cStringType As Type = GetType(CString)
Dim opCodeNameType As Type = GetType(OpCodeName)
Dim ReadObject As Action(Of CObject) = Sub(obj As CObject)
Dim objType As Type = obj.GetType
Select Case objType
Case cArrayType
Dim arrObj As CArray = DirectCast(obj, CArray)
For Each member As CObject In arrObj
ReadObject(member)
Next
Case cOperatorType
Dim opObj As COperator = DirectCast(obj, COperator)
Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName)
Case "ET", "Tx"
sbContents.Append(vbNewLine)
Case "Tj", "TJ"
For Each operand As CObject In opObj.Operands
ReadObject(operand)
Next
Case "QuoteSingle", "QuoteDbl"
sbContents.Append(vbNewLine)
For Each operand As CObject In opObj.Operands
ReadObject(operand)
Next
Case Else
'Do Nothing
End Select
Case cSequenceType
Dim seqObj As CSequence = DirectCast(obj, CSequence)
For Each member As CObject In seqObj
ReadObject(member)
Next
Case cStringType
sbContents.Append(DirectCast(obj, CString).Value)
Case cCommentType, cIntegerType, cNameType, cNumberType, cRealType
'Do Nothing
Case Else
Throw New NotImplementedException(obj.GetType().AssemblyQualifiedName)
End Select
End Sub
Using pd As PdfDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly)
For Each page As PdfPage In pd.Pages
ReadObject(ContentReader.ReadContent(page))
If maxLength > 0 And sbContents.Length >= maxLength Then
If sbContents.Length > maxLength Then
sbContents.Remove(maxLength - 1, sbContents.Length - maxLength)
End If
Exit For
End If
sbContents.Append(vbNewLine)
Next
End Using
Return sbContents.ToString
End Function
My 2 questions:
1. Would you mind taking a second to just make any suggestions?
2. Is there some type of PDF encoding viewer I can download? Very difficult for me to understand the rules that make up a PDF without some type of tag visualization.
Thanks very much!