问题
I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line
from pdfminer.pdfparser import PDFParser, PDFDocument:
ImportError: cannot import name 'PDFDocument' from 'pdfminer.pdfparser' (C:\Users[username]\Anaconda3\lib\site-packages\pdfminer\pdfparser.py)
I'm using Anaconda Jupyter. Python 3.7.3. Package pdfminer.six-20181108
The code I'm using is based on this: How to read pdf file using pdfminer3k?
Based on advice given below I've tried to uninstall and reinstall Anaconda and pdfminer.six and other packages several times: https://github.com/pdfminer/pdfminer.six/issues/196 A week ago it suddenly worked, but now I get an error again.
Since I'm working on Win10 I also tried using Linux Ubuntu as described here: https://medium.com/hugo-ferreiras-blog/using-windows-subsystem-for-linux-for-data-science-9a8e68d7610c
Same error.
Then, based on the webpage below I thought it was worth a try to split PDFparser, PDFDocument: from
from pdfminer.pdfparser import PDFParser, PDFDocument
to
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
https://loctv.wordpress.com/2017/02/07/fix-importerror-cannot-import-name-pdfdocument-when-using-slate/ .. But that created new errors later on in the code.
The start of my code looks like this:
```
path = [name and path of file]
fp = open(path, 'rb')
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
```
I expect to be able to run the code and extract the text from the pdf-file, but the code is stopped by the error relating to PDFDocument pdfminer.pdfparser
Any advice on what I should do is much appreciated! Might it has something to do with how pdfminer.six is installed?
回答1:
I got help from Notodden Serit. Change this:
from pdfminer.pdfparser import PDFParser, PDFDocument
to:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
And add parser in
doc = PDFDocument()
To:
doc = PDFDocument(parser)
And then:
for page in doc.get_pages():
To:
for page in PDFPage.create_pages(doc):
来源:https://stackoverflow.com/questions/56023686/error-cannot-import-name-pdfdocument-from-pdfminer-pdfparser