问题
I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.
https://kaijento.github.io/2017/03/27/pdf-scraping-gwinnetttaxcommissioner.publicaccessnow.com/#pdftotext
I have tried pdfminer, textract but they all extract the text as jumbled and its difficult to extract text after that.
I came across Poppler package download below:
https://poppler.freedesktop.org/releases.html
Looks like its a .tar file. And not a python package.
Am not sure how to use this .tar file to extract the package and use it in Python.
Any suggestions how I install this on my mac and then use it programatically in python to run a bunch of pdf files through this to extract data.
回答1:
Use subprocess
to call the pdftotext
program from the xpdf tools. You can find ms-windows versions of those tools at https://www.xpdfreader.com/download.html. Get the "Xpdf command line tools".
I use it like this (python 3.7):
import subprocess as sp
def pdftotext(path):
"""
Generate a text rendering of a PDF file in the form of a list of lines.
"""
args = ['pdftotext', '-layout', path, '-']
cp = sp.run(
args, stdout=sp.PIPE, stderr=sp.DEVNULL,
check=True, text=True
)
return cp.stdout
回答2:
You can try poppler for python here: https://pypi.org/project/python-poppler-qt5/
来源:https://stackoverflow.com/questions/61392015/installing-poppler-for-pdf-text-extraction