Installing Poppler for PDF text extraction

馋奶兔 提交于 2020-12-13 03:47:30

问题


I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.

https://kaijento.github.io/2017/03/27/pdf-scraping-gwinnetttaxcommissioner.publicaccessnow.com/#pdftotext

I have tried pdfminer, textract but they all extract the text as jumbled and its difficult to extract text after that.

I came across Poppler package download below:

https://poppler.freedesktop.org/releases.html

Looks like its a .tar file. And not a python package.

Am not sure how to use this .tar file to extract the package and use it in Python.

Any suggestions how I install this on my mac and then use it programatically in python to run a bunch of pdf files through this to extract data.


回答1:


Use subprocess to call the pdftotext program from the xpdf tools. You can find ms-windows versions of those tools at https://www.xpdfreader.com/download.html. Get the "Xpdf command line tools".

I use it like this (python 3.7):

import subprocess as sp

def pdftotext(path):
    """
    Generate a text rendering of a PDF file in the form of a list of lines.
    """
    args = ['pdftotext', '-layout', path, '-']
    cp = sp.run(
      args, stdout=sp.PIPE, stderr=sp.DEVNULL,
      check=True, text=True
    )
    return cp.stdout



回答2:


You can try poppler for python here: https://pypi.org/project/python-poppler-qt5/



来源:https://stackoverflow.com/questions/61392015/installing-poppler-for-pdf-text-extraction

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!