PyPDF2 ignores content, gets watermark only

时间秒杀一切 提交于 2019-12-11 07:29:33

问题


I have thousands of PDF files like this one.

I'm trying to use PyPDF2 to convert them to plain text (code is below). But PyPDF2 apparently only "sees" the watermarks, not the content itself. What could I do here?

import os
import PyPDF2

path_to_pdfs = '/path/to/pdf/files/'
for filename in os.listdir(path_to_pdfs):
    if '.pdf' in filename.lower():
        with open(path_to_pdfs + filename, mode = 'rb') as f:
            txt = ''
            pdf_reader = PyPDF2.PdfFileReader(f)
            num_pages = pdf_reader.numPages
            for page in range(num_pages):
                page_obj = pdf_reader.getPage(page)
                page_text = page_obj.extractText()
                txt = txt + '\n' + page_text
            print(txt)

I'm using Python 3.5.1 and PyPDF2 1.26.0 on macOS 10.13.14.


回答1:


Sometimes pdfminer3k gives better results. Please check out "How to read pdf file using pdfminer3k?"

I've tested the following code and it extracts text. However, the extraction is not 100% accurate...

# Open the example file
fp = open('Decisao_10166720039201098.pdf', 'rb')

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''

for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            extracted_text += lt_obj.get_text()

print(extracted_text)


来源:https://stackoverflow.com/questions/50858615/pypdf2-ignores-content-gets-watermark-only

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!