问题
I'm processing multiple pdf files using PyPDF2 but my script hangs somewhere. All I can see in my console is some "startxref on same line as offset" which I'm correct is a warning so by right it should still go to the finally block and return an empty string.
Am I doing something wrong?
import PyPDF2
import sys
import os
def decode_pdf(src_filename):
out_str=""
try:
f = open(str(src_filename), "rb")
read_pdf = PyPDF2.PdfFileReader(f)
number_of_pages = read_pdf.getNumPages()
for i in range(0,number_of_pages):
page = read_pdf.getPage(i)
out_str = out_str + " " + page.extractText()
out_str = ''.join(out_str.splitlines())
f.close()
except:
print("Exception on pdf")
print(sys.exc_info())
out_str = ""
finally:
return out_str
回答1:
I was facing this issue too and couldn't solve it using PyPDF2. I solved mine with pdfminer using the example from here
Copying the relevant code here below
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
call the function convert() as below
convert('myfile.pdf', pages=[5,7])
来源:https://stackoverflow.com/questions/45575468/pypdf2-hangs-on-processing