How to extract only specific text from PDF file using python

问题

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.

Here is the sample input PDF file (File.pdf)

Link to the full PDF file File.pdf

We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.

Script i have used so far:

    from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

But not getting the specific output value from the PDF file .

回答1:

If you want to find the data in in your way (pdfminer), you can search for a pattern to extract the data like the following (new is the regex at the end, based on your given data):

from io import StringIO
import re

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('testfile.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

finding = re.search(r"INV-\d+\n\d+\n.+\n.+\n\$\d+\.\d+", output_string.getvalue())

invoice_no, order_no, _, due_date, total_due = finding.group(0).split("\n")

print(invoice_no, order_no, due_date, total_due)

If you want to store the data in excel, you may have to be more specific (or open a new question) or look on these pages:

Writing to an Excel spreadsheet

https://www.geeksforgeeks.org/writing-excel-sheet-using-python/

https://xlsxwriter.readthedocs.io/

PS: the other answer looks like a good solution, you only have to filter the data

EDIT: Second solution. Here I use another package PyPDF2, because there you get the data in an other order (maybe this is possible with PDFMiner, too). If the text before the values are always the same, you can find the data like this:

import re
import PyPDF2

def parse_pdf() -> list:
    with open("testfile.pdf", "rb") as file:
        fr = PyPDF2.PdfFileReader(file)
        data = fr.getPage(0).extractText()

    regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)")
    regex_order_no = re.compile(r"Order Number(\d+)")
    regex_invoice_date = re.compile(r"Invoice Date(\S+ \d{1,2}, \d{4})")
    regex_due_date = re.compile(r"Due Date(\S+ \d{1,2}, \d{4})")
    regex_total_due = re.compile(r"Total Due(\$\d+\.\d{1,2})")

    invoice_no = re.search(regex_invoice_no, data).group(1)
    order_no = re.search(regex_order_no, data).group(1)
    invoice_date = re.search(regex_invoice_date, data).group(1)
    due_date = re.search(regex_due_date, data).group(1)
    total_due = re.search(regex_total_due, data).group(1)

    return [invoice_no, due_date, total_due]


if __name__ == '__main__':
    print(parse_pdf())

Maybe you have to change the regexes, because they are only based on the given example. The regexes are only working if they find the regex, so you have to work with try: except per regex ;)
If this does not answer your question, you have to provide more information/example pdfs.

回答2:

You can extract data using tabula and using that data you can create an excel file using python:

df = ("./Downloads/folder/myfile.pdf")
output = "./Downloads/folder/test.csv"
tabula.convert_into(df, output, output_format="csv", stream=True)

excel file creation: https://www.geeksforgeeks.org/python-create-and-write-on-excel-file-using-xlsxwriter-module/

来源：https://stackoverflow.com/questions/64142307/how-to-extract-only-specific-text-from-pdf-file-using-python

标签

python

dataframe

tesseract

python-tesseract