I am looking to simply extract all images from a pdf. I found some code that looks like it is exactly what I need
Private Sub getAllImages(ByVal dict As pdf.Pdf
The way this person wrote this method can seem weird if you don't understand the internals of PDFs and/or iTextSharp. The method takes three parameters, the first is a PdfDictionary
which you obtain by calling GetPageN(Integer)
on each of your pages. The second is a generic list which you need to init on your own before calling this. This method is intended to be called in a loop for each page in a PDF and each call will append images to this list. The last parameter you understand already.
So here's the code to call this method:
''//Source file to read images from
Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "FileWithImages.pdf")
''//List to dump images into
Dim Images As New List(Of Byte())
''//Main PDF reader
Dim Reader As New PdfReader(InputFile)
''//Total number of pages in the PDF
Dim PageCount = Reader.NumberOfPages
''//Loop through each page (first page is one, not zero)
For I = 1 To PageCount
getAllImages(Reader.GetPageN(I), Images, Reader)
Next
VERY, VERY IMPORTANT - iTextSharp is NOT a PDF renderer, it is a PDF composer. What this means is that it knows it has image-like objects but it doesn't necessarily know much about them. To say it another way, iTextSharp knows that a given byte array represents something that the PDF standard says is an image but it doesn't know or care if its a JPEG, TIFF, BMP or something else. All iTextSharp cares about is that this object has a few standard properties it can manipulate like X,Y and effective width and height. PDF renderers will handle the job of converting the bytes to an actual image. In this can, you are the PDF renderer so its your job to figure out how to process the byte array as an image.
Specifically, you'll see in that method that there's a line that reads:
If filter = "/FlateDecode" Then
This is often written as a select case
or switch
statement to process the various values of filter
. The method you are referencing only handles FlateDecode
which is pretty common although there are actually 10 standard filters such as CCITTFaxDecode
, JBIG2Decode
and DCTDecode
(PDF Spec 7.4 - Filters). You should modify the method to include a catch of some sort (an Else
or Default
case) so that you are at least aware of images you aren't set up to process.
Additionally, within the /FlatDecode
section you'll see this line:
Select Case Integer.Parse(bpp)
This is reading an attribute associated with the image object that tells the renderer how many bits should be used for each color when parsing. Once again, you are the PDF renderer in this case so its up to you to figure out what to do. The code that you referenced only accounts for monochrome (1 bpp) or truecolor (24 bpp) images but others should definitely be accounted for, especially 8 bpp.
So summing this up, hopefully the code works for you as is, but don't be surprised if it complains a lot and/or misses images. Extracting images can actually be very frustrating at times. If you do run into problems start a new question here referencing this one and hopefully we can help you more!