PDFsharp & MigraDoc Foundation https://forum.pdfsharp.net/ |
|
PDF to Text https://forum.pdfsharp.net/viewtopic.php?f=2&t=2740 |
Page 1 of 1 |
Author: | maxmoore14 [ Mon Feb 17, 2014 7:38 pm ] |
Post subject: | PDF to Text |
I wrote the following function to read the text out of a PDF file. It is pretty close, but I'm just not familiar enough with all the op codes to get the line spacing right. For example, I'm currently inserting a new line when I see "ET" but that doesn't seem quite right since it may just be the end of a text run, mid line. Code: Public Function ReadPDFFile(filePath As String, Optional maxLength As Integer = 0) As String Dim sbContents As New StringBuilder Dim cArrayType As Type = GetType(CArray) Dim cCommentType As Type = GetType(CComment) Dim cIntegerType As Type = GetType(CInteger) Dim cNameType As Type = GetType(CName) Dim cNumberType As Type = GetType(CNumber) Dim cOperatorType As Type = GetType(COperator) Dim cRealType As Type = GetType(CReal) Dim cSequenceType As Type = GetType(CSequence) Dim cStringType As Type = GetType(CString) Dim opCodeNameType As Type = GetType(OpCodeName) Dim ReadObject As Action(Of CObject) = Sub(obj As CObject) Dim objType As Type = obj.GetType Select Case objType Case cArrayType Dim arrObj As CArray = DirectCast(obj, CArray) For Each member As CObject In arrObj ReadObject(member) Next Case cOperatorType Dim opObj As COperator = DirectCast(obj, COperator) Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName) Case "ET", "Tx" sbContents.Append(vbNewLine) Case "Tj", "TJ" For Each operand As CObject In opObj.Operands ReadObject(operand) Next Case "QuoteSingle", "QuoteDbl" sbContents.Append(vbNewLine) For Each operand As CObject In opObj.Operands ReadObject(operand) Next Case Else 'Do Nothing End Select Case cSequenceType Dim seqObj As CSequence = DirectCast(obj, CSequence) For Each member As CObject In seqObj ReadObject(member) Next Case cStringType sbContents.Append(DirectCast(obj, CString).Value) Case cCommentType, cIntegerType, cNameType, cNumberType, cRealType 'Do Nothing Case Else Throw New NotImplementedException(obj.GetType().AssemblyQualifiedName) End Select End Sub Using pd As PdfDocument = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly) For Each page As PdfPage In pd.Pages ReadObject(ContentReader.ReadContent(page)) If maxLength > 0 And sbContents.Length >= maxLength Then If sbContents.Length > maxLength Then sbContents.Remove(maxLength - 1, sbContents.Length - maxLength) End If Exit For End If sbContents.Append(vbNewLine) Next End Using Return sbContents.ToString End Function My 2 questions: 1. Would you mind taking a second to just make any suggestions? 2. Is there some type of PDF encoding viewer I can download? Very difficult for me to understand the rules that make up a PDF without some type of tag visualization. Thanks very much! |
Author: | maxmoore14 [ Mon Feb 17, 2014 7:40 pm ] |
Post subject: | Re: PDF to Text |
Also, if you are curious, the reason I am doing: Code: Select Case System.Enum.GetName(opCodeNameType, opObj.OpCode.OpCodeName) instead of just: Code: Select Case opObj.OpCode.OpCodeName is because VB is case-insensitive, so Tj and TJ are the same thing. |
Page 1 of 1 | All times are UTC |
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/ |