PDFsharp & MigraDoc Foundation

PDFsharp - A .NET library for processing PDF & MigraDoc Foundation - Creating documents on the fly
It is currently Thu Jul 18, 2024 2:31 pm

All times are UTC


Forum rules


Please read this before posting on this forum: Forum Rules



Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Wed May 19, 2010 2:52 pm 
Offline

Joined: Wed May 19, 2010 2:40 pm
Posts: 5
Apologies if this has been dealt with before. I searched the forum for 'large image' 'split image' and other key words describing what I think I am dealing with but did not find an answer.

Need to extract images from a PDF file. Used the PdfSharp Extract Image example as a starting point. Even added the ability to extract TIFF files with the use of libtiff.

Things were working great. Until a large image was encountered. It is packed differently in the PDF.

The smaller image is packed in the PDF (by scanner software) like this...

<</Type/XObject
/Subtype/Image
/Width 2344
/Height 1654
/BitsPerComponent 1
/ColorSpace/DeviceGray
/Filter /CCITTFaxDecode
/DecodeParms <</Columns 2344 /Rows 1654>>
/Length 22493


The larger image is packed in the PDF (by scanner software) like this...

<</Type/XObject
/Subtype/Image
/Width 2344
/Height 1654
/BitsPerComponent 1
/ColorSpace/DeviceGray
/Decode[1 0]
/Length 484622

Because the /Filter tag is missing the PdfSharp Extract Image code of course fails to decode the image. And I noticed that /DecodeParams is replaced with /Decode[1 0] which I take to mean this large image has been broken in to two smaller objects in position 1 and 0 of some sub part. Can anyone lend a hand here?

It's like after parsing the /Subtype/Image token another step needs to be done to inspect if /DecodeParams or /Decode[1 0] is present. And if it is to drop down one more loop to collect the data. But I don't know how to piece the sub parts together.

Thanks!


Top
 Profile  
Reply with quote  
PostPosted: Wed May 19, 2010 3:01 pm 
Offline

Joined: Wed May 19, 2010 2:40 pm
Posts: 5
Wait a minute... looking at some other posts please do not tell me if the /Filter is missing one must then look at the /Colorspace and the /Decode and decode this yourself?

Should I take this to mean...

<</Type/XObject
/Subtype/Image
/Width 2344
/Height 1654
/BitsPerComponent 1
/ColorSpace/DeviceGray
/Decode[1 0]
/Length 484622

This is a black and white image, 1 bit per pixel, and the values are 0 = black, 1 = white in one large contiguous block of data?


Top
 Profile  
Reply with quote  
PostPosted: Thu May 20, 2010 7:22 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3101
Location: Cologne, Germany
Hi!

"/Decode[1 0]" simply inverts the image.
There is only one part of the image.

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
PostPosted: Thu May 20, 2010 7:23 pm 
Offline

Joined: Wed May 19, 2010 2:40 pm
Posts: 5
Ok so the problem now is purely a graphical one of pulling pixels and stuffing them in to a Bitmap. Something beyond my skill set. And yes off topic but I'm hoping someone can help. Thanks!

This gets close but each successive row in the image appears offset. So I know it is word alignment padding or stride or some such with Bitmaps. Just not sure how to calculate.

Code:
static void ExportBitmapImage(PdfDictionary image)
{

   int bitsPerComponent = image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent);
   int width = image.Elements.GetInteger(PdfImage.Keys.Width);
   int height = image.Elements.GetInteger(PdfImage.Keys.Height);
   byte[] pixels = image.Stream.Value;
   
   if (bitsPerComponent == 1) {

      Bitmap bmp = new Bitmap(width, height,PixelFormat.Format1bppIndexed);

      BitmapData bmpData = bmp.LockBits(new Rectangle(0,0,width,height, ImageLockMode.ReadWrite, bmp.PixelFormat);

      ' please help here converting contiguous byte array in to what bitmap wants or requires

      Marshal.Copy(pixels, 0, bmpData.Scan0, pixels.Length);
     
      bmp.UnlockBits(bmpData);
         
      bmp.Save("not-quite-right.bmp");

   }

}


Top
 Profile  
Reply with quote  
PostPosted: Fri May 21, 2010 2:19 pm 
Offline

Joined: Thu Feb 25, 2010 2:44 pm
Posts: 14
My co-worker found a solution on StackOverflow, although the code posted there was not perfect because it assumed it would work for all images when in fact it only works for monochrome (hence my comment in the switch block):

Code:
// Assume you've already obtained the image dictionary
// in the variable 'xObject'
string filter = xObject.Elements.GetName(PdfImage.Keys.Filter);

switch (filter)
{
   // ...
   // Other cases omitted for clarity
   // ...
   case "/FlateDecode":
      byte[] raw = Filtering.FlateDecode.Decode(xObject.Stream.Value);
      int width = xObject.Elements.GetInteger(PdfImage.Keys.Width);
      int height = xObject.Elements.GetInteger(PdfImage.Keys.Height);
      int bitsPerComponent = xObject.Elements.GetInteger(PdfImage.Keys.BitsPerComponent);
      PixelFormat pixelFormat;
      
      switch (bitsPerComponent)
      {
         case 1:
            pixelFormat = PixelFormat.Format1bppIndexed;
            break;
         case 8:
            // TODO: The Marshal.Copy code below will only work with monochrome
            // bitmaps, so color bitmaps need to be handled differently
            // (By the way, PDFsharp forum, I have written code to handle this,
            // at least for non-transparent color images. I'll post it once it
            // handles transparency too.)
            pixelFormat = PixelFormat.Format24bppRgb;
            break;
         default:
            throw new Exception(String.Format("Unknown pixel format {0}.", bitsPerComponent));
      }
      
      Bitmap bitmap = new Bitmap(width, height, pixelFormat);
      BitmapData bitmapData = bitmap.LockBits(new Rectangle(0, 0, width, height), ImageLockMode.WriteOnly, pixelFormat);
      Marshal.Copy(raw, 0, bitmapData.Scan0, raw.Length);
      bitmap.UnlockBits(bitmapData);
      using (MemoryStream imageStream = new MemoryStream())
      {
         bitmap.Save(imageStream, ImageFormat.Jpeg);
         // Do something useful with imageStream
      }
      break;
}

It worked one the one or two tests I put it through. Let me know if it works for you.


Top
 Profile  
Reply with quote  
PostPosted: Fri May 21, 2010 3:02 pm 
Offline

Joined: Wed May 19, 2010 2:40 pm
Posts: 5
For the critical parts that is the same code as posted. Any other ideas?

Was this code operational on any PDF? Or by any chance just against PDFs where the streams in question contained raw pixels from Bitmaps that were Format1BppIndexed?

The reason I asked is the following code produces the correct image using the same concept only against libtiff using raw format and not C# Bitmap class. So I'm wondering if this is a problem with Bitmaps and Format1BppIndexed not supporing this operation?

Code:

byte[] pixels = xobject.Stream.Value;

int tif = TIFFOpen("c:\\example.tif", "w");

TIFFSetField(tif, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEWIDTH, (uint)width);

TIFFSetField(tif, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEHEIGHT, (uint)height);

TIFFSetField(tif, (uint)BitMiracle.LibTiff.Classic.TiffTag.COMPRESSION,
                      (uint)BitMiracle.LibTiff.Classic.Compression.NONE);

TIFFSetField(tif, (uint)BitMiracle.LibTiff.Classic.TiffTag.PHOTOMETRIC
                      (uint)BitMiracle.LibTiff.Classic.Photometric.MINISWHITE);

TIFFSetField(tif, (uint)BitMiracle.LibTiff.Classic.TiffTag.BITSPERSAMPLE,
                      (uint)bitsPerComponent);

TIFFSetField(tif, (uint)BitMiracle.LibTiff.Classic.TiffTag.SAMPLESPERPIXEL, 1);

IntPtr pointer = Marshal.AllocHGlobal(pixels.length);

Marshal.Copy(pixels, 0, pointer, pixels.length);

TIFFWriteRawStrip(tif, 0, pointer, pixels.length);

TIFFClose(tif);





Again I'm new to this but here are the tags for the stream in question. And there is no /Filter tag so this is just raw pixels right? 1 bit per pixel inverting black and white?

<</Type/XObject
/Subtype/Image
/Width 2344
/Height 1654
/BitsPerComponent 1
/ColorSpace/DeviceGray
/Decode[1 0]
/Length 484622


Top
 Profile  
Reply with quote  
PostPosted: Fri May 21, 2010 3:09 pm 
Offline

Joined: Wed May 19, 2010 2:40 pm
Posts: 5
Attached is a screen shot of the output when using the Bitmap, LockBit, Marshal.Copy approach. Sorry cannot send the original it contains information of a sensitive nature. You can see it looks like a stride / padding type issue.


Attachments:
File comment: example output with Bitmap LockBit Marshal.Copy method
screenshot-small.png
screenshot-small.png [ 62.85 KiB | Viewed 15389 times ]
Top
 Profile  
Reply with quote  
PostPosted: Fri May 21, 2010 3:55 pm 
Offline
User avatar

Joined: Tue Oct 14, 2008 6:15 pm
Posts: 32
Location: USA
If the image is a bitmap then the pixel data would be stored from the bottom to the top, left to right...which could explain the funky output.

http://en.wikipedia.org/wiki/BMP_file_format


Top
 Profile  
Reply with quote  
PostPosted: Fri May 21, 2010 6:27 pm 
Offline

Joined: Thu Feb 25, 2010 2:44 pm
Posts: 14
Aha! The image I was testing just happened to have the right width to cause each scan line to naturally end on a 32-bit boundary (i.e., each line had a multiple of 4 bytes). When I added 4 pixels to the width to change this, the resulting extracted bitmap was shifted, just like yours. This happened regardless of whether the image was FlateDecoded or stored uncompressed. (I tested this because I was pretty sure it made no difference whether the image was compressed or not.)

So it looks like the direct copy method is a bust 75% of the time and you need to write code to pad out each scan line to a 4-byte boundary. I guess I really should get around to implementing my ExtractIndexedImageGrayscale() method, huh?

And no, Soldier-B, it has nothing to do with bottom-to-top; it's all about the 4-byte boundary padding.


Top
 Profile  
Reply with quote  
PostPosted: Tue May 25, 2010 8:07 am 
Offline
PDFsharp Guru
User avatar

Joined: Mon Oct 16, 2006 8:16 am
Posts: 3101
Location: Cologne, Germany
With PDF bitmaps, each row is padded to a BYTE boundary (multiple of 8 bits).
With Windows bitmaps, each row is padded to a DWORD boundary (multiple of 32 bits).

_________________
Regards
Thomas Hoevel
PDFsharp Team


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: Bing [Bot] and 36 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Privacy Policy, Data Protection Declaration, Impressum
Powered by phpBB® Forum Software © phpBB Group