ITEXTSHARP in .net, Extract image code, I can't get this to work

前端未结

关注

 1  491

I am looking to simply extract all images from a pdf. I found some code that looks like it is exactly what I need

Private Sub getAllImages(ByVal dict As pdf.Pdf


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2021-01-23 07:47
              
            
            
                                                                       
The way this person wrote this method can seem weird if you don't understand the internals of PDFs and/or iTextSharp. The method takes three parameters, the first is a PdfDictionary which you obtain by calling GetPageN(Integer) on each of your pages. The second is a generic list which you need to init on your own before calling this. This method is intended to be called in a loop for each page in a PDF and each call will append images to this list. The last parameter you understand already.

So here's the code to call this method:

''//Source file to read images from
Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "FileWithImages.pdf")

''//List to dump images into
Dim Images As New List(Of Byte())

''//Main PDF reader
Dim Reader As New PdfReader(InputFile)

''//Total number of pages in the PDF
Dim PageCount = Reader.NumberOfPages

''//Loop through each page (first page is one, not zero)
For I = 1 To PageCount
    getAllImages(Reader.GetPageN(I), Images, Reader)
Next


VERY, VERY IMPORTANT - iTextSharp is NOT a PDF renderer, it is a PDF composer. What this means is that it knows it has image-like objects but it doesn't necessarily know much about them. To say it another way, iTextSharp knows that a given byte array represents something that the PDF standard says is an image but it doesn't know or care if its a JPEG, TIFF, BMP or something else. All iTextSharp cares about is that this object has a few standard properties it can manipulate like X,Y and effective width and height. PDF renderers will handle the job of converting the bytes to an actual image. In this can, you are the PDF renderer so its your job to figure out how to process the byte array as an image.

Specifically, you'll see in that method that there's a line that reads:

If filter = "/FlateDecode" Then


This is often written as a select case or switch statement to process the various values of filter. The method you are referencing only handles FlateDecode which is pretty common although there are actually 10 standard filters such as CCITTFaxDecode, JBIG2Decode and DCTDecode (PDF Spec 7.4 - Filters). You should modify the method to include a catch of some sort (an Else or Default case) so that you are at least aware of images you aren't set up to process.

Additionally, within the /FlatDecode section you'll see this line:

Select Case Integer.Parse(bpp)


This is reading an attribute associated with the image object that tells the renderer how many bits should be used for each color when parsing. Once again, you are the PDF renderer in this case so its up to you to figure out what to do. The code that you referenced only accounts for monochrome (1 bpp) or truecolor (24 bpp) images but others should definitely be accounted for, especially 8 bpp.

So summing this up, hopefully the code works for you as is, but don't be surprised if it complains a lot and/or misses images. Extracting images can actually be very frustrating at times. If you do run into problems start a new question here referencing this one and hopefully we can help you more!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复