问题
I want to know if there is any package to detect and extrac the header and footer or title page from PDF document ? I am new in text mining using python and I want to know for example pdfminer.layout could help to find any text block in pdfs?
回答1:
Apache Tika also does metadata extraction. You can also extract names, title/multiple-titles, date, number of pages, modified dates, and many more.
import tika
from tika import parser
filename = "your file name here"
parsedPDF = parser.from_file(file_name)
print(parsedPDF['content'])
print(parsedPDF['metadata']) # its in a dictionary format.
回答2:
I'm using this utility function to extract all text elements from PDF:
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfparser import PDFParser
def pdf2text(stream):
parser = PDFParser(stream)
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
resmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(resmgr, laparams=laparams)
interpreter = PDFPageInterpreter(resmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
for obj in device.get_result():
if isinstance(obj, (LTTextBox, LTTextLine)):
yield obj.get_text()
stream
parameter is a file-like object (e.g. file opened for reading or an instance of io.BytesIO
or such).
This example basically follows official example.
来源:https://stackoverflow.com/questions/48306295/is-there-any-way-to-extract-header-and-footer-and-title-page-of-a-pdf-document