PDFsharp & MigraDoc Foundation https://forum.pdfsharp.net/ |
|
How to modify resource stream and write back to document https://forum.pdfsharp.net/viewtopic.php?f=2&t=4536 |
Page 1 of 1 |
Author: | Husky [ Wed Jan 17, 2024 2:19 pm ] |
Post subject: | How to modify resource stream and write back to document |
Hello, I would like to parse and modify some PDF contents. What exactly needs to be modified does not matter. My code is generic. Some contents are in the page stream, and some are inside nested /Form objects inside the page resource dictionary. I am fairly (very) familiar and comfortable with the PDF Specification and all operators and have an almost working code listed below. The only part I have trouble with is marked in my source code and listed below 3 questions. I have reduced the source code to only the essential part, replacing resource objects streams in the resource dictionary. However it is complete code. With the provided sample file that shows the problem, I get a blank page because the created file is missing resource stream /content on object /Xf1 which is object 2 0 R I am using PDFSharp version 1.50.5147 on .Net Framework 4.6.2 and 4.8 My questions are marked in the source below which is the complete code that can reproduce the problem
2-How do I attach that byte array back to the Stream object ? My Code sets a new stream, but it is not written into the output file. 3-How do I persist the modified Stream in the Resource dictionary, so pdfDoc.Save will save changes ? What am I doing wrong? Any help appreciated. Thank you and brilliant work on this library. The processing speed is amazing and possibilities thanks to low level functions are endless. Sample PDF File 552kb : https://www.dropbox.com/scl/fi/d77iyy9x ... u1w9g&dl=0 To call the main function use Call ParseAndModifyContent("input.pdf", "output.pdf") Code: Imports PdfSharp.Pdf
Imports PdfSharp.Pdf.Advanced Imports PdfSharp.Pdf.Content Imports PdfSharp.Pdf.Content.Objects Imports PdfSharp.Pdf.IO Private Sub ParseAndModifyContent(input As String, output As String) Using pdfdoc As PdfDocument = PdfReader.Open(input, PdfDocumentOpenMode.Modify) Dim page As PdfPage Dim pagecount As Integer = pdfdoc.PageCount 'loop all pages For ipage As Integer = 0 To pagecount - 1 page = pdfdoc.Pages(ipage) Dim contents As CSequence = ContentReader.ReadContent(page) Call ProcessContentObjects(contents) page.Contents.ReplaceContent(contents) Dim pdfres As PdfResources = page.Resources Call ProcessResources(pdfres) Next 'save modified file pdfdoc.Save(output) End Using End Sub Private Sub ProcessContentObjects(contents As Objects.CSequence) Dim cOp As COperator 'loop all content objects For i As Integer = 0 To contents.Count - 1 If contents(i).GetType Is GetType(COperator) Then cOp = contents(i) Debug.WriteLine(cOp.OpCode.Name & " - " & cOp.OpCode.Postscript & " - " & cOp.OpCode.Description) 'do anything needed with this Operator and its Operands 'this part works fine End If Next End Sub Private Sub ProcessResources(res As PdfDictionary) 'check if XObjects exist If res IsNot Nothing AndAlso res.Elements.ContainsKey(PdfResources.Keys.XObject) Then Dim xObj As PdfDictionary = res.Elements.GetDictionary(PdfResources.Keys.XObject) If xObj IsNot Nothing Then Dim items As ICollection(Of PdfItem) = xObj.Elements.Values For Each item As PdfItem In items If item.GetType Is GetType(PdfReference) Then Dim ref As PdfReference = DirectCast(item, PdfReference) Debug.WriteLine("ObjectNumber = " & ref.ObjectNumber) 'avoid processing endless recursions using this number if needed Dim xObj2 As PdfDictionary = ref.Value 'check if Subtype is /Form If xObj2.Elements.GetString("/Subtype") = "/Form" Then 'get content bytes of stream Dim stream As PdfDictionary.PdfStream = xObj2.Stream 'check if a content stream exists If stream IsNot Nothing Then 'get unfiltered/uncompressed bytes Dim contentbytes() As Byte = stream.UnfilteredValue Dim encoder As Internal.RawEncoding = New PdfSharp.Pdf.Internal.RawEncoding() 'get stream content as string to check visually Dim content_string As String = encoder.GetString(contentbytes) Debug.WriteLine(content_string) 'get content objects Dim contents As CSequence = ContentReader.ReadContent(contentbytes) 'process content objects same as page contents Call ProcessContentObjects(contents) '--------PROBLEM STARTS HERE '1-How do I turn the modified content object back into a byte array '2-How do I attach that byte array back to the Stream object '3-How do I persist the modified Stream in the Resource dictionary, so pdfDoc.Save will save changes ? 'testing with unmodified content, just writing same bytes back Dim modifiedcontentbytes() As Byte = contentbytes.Clone 'write modified content back to stream and compress '-------THIS PART FAILS-------Output PDF has no Stream, but no error in code xObj2.Stream = Nothing xObj2.Stream = xObj2.CreateStream(modifiedcontentbytes) xObj2.Stream.Zip() End If 'get nested resources if they exist Dim res2 As PdfDictionary = xObj2.Elements.GetDictionary("/Resources") 'recursive call If res2 IsNot Nothing Then Call ProcessResources(res2) ElseIf xObj2.Elements.GetString("/Subtype") = "/Image" Then 'process anything for /Image Else 'process anything for other Subtypes Debug.WriteLine(xObj2.Elements.GetString("/Subtype")) End If End If Next End If End If End Sub |
Author: | (void) [ Sat Jan 20, 2024 4:51 pm ] |
Post subject: | Re: How to modify resource stream and write back to document |
Short answer to all 3 questions: You already did everything, that is needed. There is just a little detail missing: When overwriting stream-data, make sure you remove all filters from the parent-dictionary beforehand. (the /Filter entry) In the linked document the stream has already a filter applied (with the value /FlateDecode). When you attach a new stream with Code: xObj2.Stream = xObj2.CreateStream(modifiedcontentbytes) the stream-data is no longer flate-encocded as the filter states.The following call to Code: xObj2.Stream.Zip() does nothing, as there is already a filter specified in the parent-dictionary.The stream-data is still saved, but is no longer flate-encoded. A Pdf-Reader trying to open the document may fail silently trying to decode the stream, resulting in an "empty" document. This should work (untested): Code: 'Remove existing filter
xObj2.Elements.Remove("/Filter") ' Set new value for stream (has the same effect as creating a new stream) xObj2.Stream.Value = modifiedcontentbytes ' Zip it, this creates a new /Filter -entry in the parent-dictionary xObj2.Stream.Zip() |
Author: | Husky [ Sat Jan 20, 2024 7:09 pm ] |
Post subject: | Re: How to modify resource stream and write back to document |
Your suggestion is a giant step into the right direction - very much appreciated, but one piece of the puzzle is still missing. First the good news: The missing stream data and it's filter are now properly created. A valid PDF with no errors is created as output. So replacing the stream data in the resource dictionary with unmodified bytes was successful. I would have never discovered that one needs to remove the existing /Filter first. Thank you very much for that guidance. The last part of the puzzle was unsolved in my code...my sample code wrote back the unmodified byte array...that works But how do I convert modified objects of type CSequence back to a byte array so they can be written back to the stream ? Code: Dim contents As CSequence = ContentReader.ReadContent(contentbytes) 'works Call ProcessContentObjects(contents) 'works 'I cannot find a method to convert modified object of type CSequence back to a byte array. Dim modifiedcontentbytes() As Byte = ContentWriter.WriteContentFromObjects(contents) 'looking for something like this In the part where page content is directly accessible modification is done over the reference to the page content object and written back, page contents are replaced with an additional line of code Code: Dim contents As CSequence = ContentReader.ReadContent(page) Call ProcessContentObjects(contents) page.Contents.ReplaceContent(contents) This is not possible for nested /Form objects because there does not seem to be a method for Code: Dim pdfres As PdfResources = page.Resources 'works Call ProcessResources(pdfres) 'works page.Resources.ReplaceResources(pdfres) 'there is no such method There has to be a simple way to convert an Object of type CSequence to a byte array that can be written to the stream, or am I missing something ? |
Author: | Husky [ Sat Jan 20, 2024 7:35 pm ] |
Post subject: | Re: How to modify resource stream and write back to document |
solved it....one liner Code: Dim modifiedcontentbytes() As Byte = contents.ToContent works brilliant In case the Development Team reads this, viewtopic.php?f=3&t=3468 same issue there about removing /Filter first, which is not intuitive to the average user. Any of the following would make this great library even better 1- update the existing summary to function CreateStream ' Summary: ' Creates the stream of this dictionary and initializes it with the specified byte ' array. The function must not be called if the dictionary already has a stream. ' You may need to remove the existing /Filter from the parent element object first Public Function CreateStream(value() As Byte) As PdfStream 2- update the existing summary to property ' Summary: ' Gets or sets the PDF stream belonging to this dictionary. Returns null if the ' dictionary has no stream. To create the stream, call the CreateStream function. ' You may need to remove the existing /Filter from the parent element object first Public Property Stream As PdfStream 3- Implement Method .CreateStreamWithFilter, or allow an additonal Parameter in Public Function CreateStream(value() As Byte, updateFilter as Boolean) As PdfStream and let PDFSharp set/clear the existing /Filter on the parent stream object 4- ' Summary: ' Compresses the stream with the FlateDecode filter. If a filter is already defined, ' the function has no effect. Public Sub Zip() Add an optional parameter or overload Public Sub Zip(ResetFilter as boolean) which deletes the existing Filter, does the zipping and sets its own proper filter. Thank you |
Page 1 of 1 | All times are UTC |
Powered by phpBB® Forum Software © phpBB Group https://www.phpbb.com/ |