问题
I have used this code to convert pdf to text.
input1 = '//Home//Sai Krishna Dubagunta.pdf'
output = '//Home//Me.txt'
os.system(("pdftotext %s %s") %( input1, output))
I have created the Home directory and pasted the source file in it.
The output I get is
1
And no file with .txt was created. Where is the Problem?
回答1:
Your expression
("pdftotext %s %s") %( input1, output)
will translate to
pdftotext //Home//Sai Krishna Dubagunta.pdf //Home//Me.txt
which means that the first parameter passed to pdftotext
is //Home//Sai
, and the second parameter is Krishna
. That obviously won't work.
Enclose the parameters in quotes:
os.system("pdftotext '%s' '%s'" % (input1, output))
回答2:
There are various Python packages to extract the text from a PDF with Python.
pdftotext
pdftotext package: Seems to work pretty well, but it has no options e.g. to extract bounding boxes
Installation
For Ubuntu:
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
Minimal Working Example
import pdftotext
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Iterate over all the pages
for page in pdf:
print(page)
# Just read the second page
print(pdf.read(2))
# Or read all the text at once
print(pdf.read_all())
PDF miner
Install it with pip install pdfminer.six
. A minimal working example is here.
回答3:
I think pdftotext command takes only one argument. Try using:
os.system(("pdftotext %s") % input1)
and see what happens. Hope this helps.
来源:https://stackoverflow.com/questions/23821204/read-pdf-in-python-and-convert-to-text-in-pdf