Is there any way to extract header and footer and title page of a PDF document?

梦想的初衷 提交于 2019-12-13 03:01:20

问题


I want to know if there is any package to detect and extrac the header and footer or title page from PDF document ? I am new in text mining using python and I want to know for example pdfminer.layout could help to find any text block in pdfs?


回答1:


Apache Tika also does metadata extraction. You can also extract names, title/multiple-titles, date, number of pages, modified dates, and many more.

import tika
from tika import parser

filename = "your file name here"
parsedPDF = parser.from_file(file_name)
print(parsedPDF['content'])
print(parsedPDF['metadata']) # its in a dictionary format. 



回答2:


I'm using this utility function to extract all text elements from PDF:

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfparser import PDFParser


def pdf2text(stream):
    parser = PDFParser(stream)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    resmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(resmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(resmgr, device)
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        for obj in device.get_result():
            if isinstance(obj, (LTTextBox, LTTextLine)):
                yield obj.get_text()

stream parameter is a file-like object (e.g. file opened for reading or an instance of io.BytesIO or such).

This example basically follows official example.



来源:https://stackoverflow.com/questions/48306295/is-there-any-way-to-extract-header-and-footer-and-title-page-of-a-pdf-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!