Extracting text from PDF page's certain areas?

好久不见. 提交于 2019-12-13 03:10:19

问题


I am trying to parse a PDF-book, but I only need the main body WITHOUT footers, headers or footnotes.

I looked through pdfminer documentation but I haven't succeeded yet. Here is the code I use for getting the text:

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

with open(pdfname, 'rb') as fh:
    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        resource_manager = PDFResourceManager()
        fake_file_handle = io.StringIO()
        converter = TextConverter(resource_manager, fake_file_handle)
        page_interpreter = PDFPageInterpreter(resource_manager, converter)
        page_interpreter.process_page(page)
        text = fake_file_handle.getvalue()

来源:https://stackoverflow.com/questions/56008028/extracting-text-from-pdf-pages-certain-areas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!