Read PDF in Python and convert to text in PDF

荒凉一梦 提交于 2019-12-11 03:26:22

问题


I have used this code to convert pdf to text.

input1 = '//Home//Sai Krishna Dubagunta.pdf'
output = '//Home//Me.txt'
os.system(("pdftotext %s %s") %( input1, output))

I have created the Home directory and pasted the source file in it.

The output I get is

1

And no file with .txt was created. Where is the Problem?


回答1:


Your expression

("pdftotext %s %s") %( input1, output)

will translate to

pdftotext //Home//Sai Krishna Dubagunta.pdf //Home//Me.txt

which means that the first parameter passed to pdftotext is //Home//Sai, and the second parameter is Krishna. That obviously won't work.

Enclose the parameters in quotes:

os.system("pdftotext '%s' '%s'" % (input1, output))



回答2:


There are various Python packages to extract the text from a PDF with Python.

pdftotext

pdftotext package: Seems to work pretty well, but it has no options e.g. to extract bounding boxes

Installation

For Ubuntu:

sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev

Minimal Working Example

import pdftotext

with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

# Just read the second page
print(pdf.read(2))

# Or read all the text at once
print(pdf.read_all())

PDF miner

Install it with pip install pdfminer.six. A minimal working example is here.




回答3:


I think pdftotext command takes only one argument. Try using:

os.system(("pdftotext %s") % input1)

and see what happens. Hope this helps.



来源:https://stackoverflow.com/questions/23821204/read-pdf-in-python-and-convert-to-text-in-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!