ITEXTSHARP in .net, Extract image code, I can't get this to work

前端 未结 1 487
眼角桃花
眼角桃花 2021-01-23 07:17

I am looking to simply extract all images from a pdf. I found some code that looks like it is exactly what I need

Private Sub getAllImages(ByVal dict As pdf.Pdf         


        
1条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-23 07:47

    The way this person wrote this method can seem weird if you don't understand the internals of PDFs and/or iTextSharp. The method takes three parameters, the first is a PdfDictionary which you obtain by calling GetPageN(Integer) on each of your pages. The second is a generic list which you need to init on your own before calling this. This method is intended to be called in a loop for each page in a PDF and each call will append images to this list. The last parameter you understand already.

    So here's the code to call this method:

    ''//Source file to read images from
    Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "FileWithImages.pdf")
    
    ''//List to dump images into
    Dim Images As New List(Of Byte())
    
    ''//Main PDF reader
    Dim Reader As New PdfReader(InputFile)
    
    ''//Total number of pages in the PDF
    Dim PageCount = Reader.NumberOfPages
    
    ''//Loop through each page (first page is one, not zero)
    For I = 1 To PageCount
        getAllImages(Reader.GetPageN(I), Images, Reader)
    Next
    

    VERY, VERY IMPORTANT - iTextSharp is NOT a PDF renderer, it is a PDF composer. What this means is that it knows it has image-like objects but it doesn't necessarily know much about them. To say it another way, iTextSharp knows that a given byte array represents something that the PDF standard says is an image but it doesn't know or care if its a JPEG, TIFF, BMP or something else. All iTextSharp cares about is that this object has a few standard properties it can manipulate like X,Y and effective width and height. PDF renderers will handle the job of converting the bytes to an actual image. In this can, you are the PDF renderer so its your job to figure out how to process the byte array as an image.

    Specifically, you'll see in that method that there's a line that reads:

    If filter = "/FlateDecode" Then
    

    This is often written as a select case or switch statement to process the various values of filter. The method you are referencing only handles FlateDecode which is pretty common although there are actually 10 standard filters such as CCITTFaxDecode, JBIG2Decode and DCTDecode (PDF Spec 7.4 - Filters). You should modify the method to include a catch of some sort (an Else or Default case) so that you are at least aware of images you aren't set up to process.

    Additionally, within the /FlatDecode section you'll see this line:

    Select Case Integer.Parse(bpp)
    

    This is reading an attribute associated with the image object that tells the renderer how many bits should be used for each color when parsing. Once again, you are the PDF renderer in this case so its up to you to figure out what to do. The code that you referenced only accounts for monochrome (1 bpp) or truecolor (24 bpp) images but others should definitely be accounted for, especially 8 bpp.

    So summing this up, hopefully the code works for you as is, but don't be surprised if it complains a lot and/or misses images. Extracting images can actually be very frustrating at times. If you do run into problems start a new question here referencing this one and hopefully we can help you more!

    0 讨论(0)
提交回复
热议问题