pypdf Merging multiple pdf files into one pdf

后端未结

关注

 5  1673

If I have 1000+ pdf files need to be merged into one pdf,

input = PdfFileReader()
output = PdfFileWriter()
filename0000 ----- filename 1000
    input = PdfFi


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-01 03:24
              
            
            
                                                                       
I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.

Note: I am assuming that filename is a well-formed file path string.  Assume the same for all of my code

The Short Answer

Use the PdfFileMerger() class instead of the PdfFileWriter() class.  I've tried to provide the following to as closely resemble your content as I could:

from PyPDF2 import PdfFileMerger, PdfFileReader

[...]

merger = PdfFileMerger()
for filename in filenames:
    merger.append(PdfFileReader(file(filename, 'rb')))

merger.write("document-output.pdf")


The Long Answer

The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24.  To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file).  Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle.  They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.

To solve this, create copies in memory of the content, and allow the file to be closed.  I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead.  I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.

My initial attempts looked like the following, and were resulting in the same IO Problems:

merger = PdfFileMerger()
for filename in filenames:
    merger.append(filename)

merger.write(output_file_path)


Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):

    if type(fileobj) in (str, unicode):
        fileobj = file(fileobj, 'rb')
        my_file = True
    elif type(fileobj) == file:
        fileobj.seek(0)
        filecontent = fileobj.read()
        fileobj = StringIO(filecontent)
        my_file = True
    elif type(fileobj) == PdfFileReader:
        orig_tell = fileobj.stream.tell()   
        fileobj.stream.seek(0)
        filecontent = StringIO(fileobj.stream.read())
        fileobj.stream.seek(orig_tell)
        fileobj = filecontent
        my_file = True


We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location.  The end result is the exact same thing we're trying to avoid.  A PdfFileReader() object holding open a file until the file is eventually written!

However, if we either make a file object of the file path string or a PdfFileReader^{(see Edit 2)} object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.

I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().

Hope this helped!

EDIT: I assumed you were using PyPDF2, not PyPDF.  If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.  

If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you.  In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object. 

EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks @Agostino).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2020-12-01 03:27
              
            
            
                                                                       
The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files.  Here is an example concatenation script.

The relevant part -- assumes inputs is a list of input filenames, and outfn is an output file name:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)


Disclaimer:  I am the primary pdfrw author.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轮回少年        
                
              
                            
                2020-12-01 03:28
              
            
            
                                                                       
I have written this code to help with the answer:-

import sys
import os
import PyPDF2

merger = PyPDF2.PdfFileMerger()

#get PDFs files and path

path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)


#iterate among the documents
for pdf in pdfs:
    try:
        #if doc exist then merge
        if os.path.exists(pdf):
            input = PyPDF2.PdfFileReader(open(pdf,'rb'))
            merger.append((input))
        else:
            print(f"problem with file {pdf}")

    except:
            print("cant merge !! sorry")
    else:
            print(f" {pdf} Merged !!! ")

merger.write("Merged_doc.pdf")


In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2020-12-01 03:43
              
            
            
                                                                       
The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.

What you could try is closing the files in the for loop:

input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
   f = open(file, 'rb')
   input = PdfFileReader(f)
   # Some code
   f.close()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  萌比男神i        
                
              
                            
                2020-12-01 03:47
              
            
            
                                                                       
It maybe just what it says, you are opening to many files.
You may explicitly use f=file(filename) ... f.close() in the loop, or use the with statement. So that each opened file is properly closed. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复