问题
I am trying to parse a PDF-book, but I only need the main body WITHOUT footers, headers or footnotes.
I looked through pdfminer documentation but I haven't succeeded yet. Here is the code I use for getting the text:
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
with open(pdfname, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
来源:https://stackoverflow.com/questions/56008028/extracting-text-from-pdf-pages-certain-areas