Is there any way to extract header and footer and title page of a PDF document?

问题

I want to know if there is any package to detect and extrac the header and footer or title page from PDF document ? I am new in text mining using python and I want to know for example pdfminer.layout could help to find any text block in pdfs?

回答1:

Apache Tika also does metadata extraction. You can also extract names, title/multiple-titles, date, number of pages, modified dates, and many more.

import tika
from tika import parser

filename = "your file name here"
parsedPDF = parser.from_file(file_name)
print(parsedPDF['content'])
print(parsedPDF['metadata']) # its in a dictionary format.

回答2:

I'm using this utility function to extract all text elements from PDF:

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfparser import PDFParser


def pdf2text(stream):
    parser = PDFParser(stream)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    resmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(resmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(resmgr, device)
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        for obj in device.get_result():
            if isinstance(obj, (LTTextBox, LTTextLine)):
                yield obj.get_text()

stream parameter is a file-like object (e.g. file opened for reading or an instance of io.BytesIO or such).

This example basically follows official example.

来源：https://stackoverflow.com/questions/48306295/is-there-any-way-to-extract-header-and-footer-and-title-page-of-a-pdf-document

标签

python

pdf

text-mining