Using PDFMiner (Python) with online pdf files. Encode the url?

风格不统一 提交于 2019-12-02 00:08:01

问题


I am wishing to extract the content of pdf files available online using PDFMiner.

My code is based on the one available in the documentation used to extract the content of PDF files on the hard disk:

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
document = PDFDocument(parser)

That works quite well with some small changes.

Now, I have tried urllib2.openurl for online PDFs but that doesn't work. I get an error message : coercing to Unicode: need string or buffer, instance found.

How can I get a string (or whatever) from urllib2.openurl so that it is the same as what the open function when I give it a PDF file name (versus an URL)`?

Please tell me if my question is not clear.


回答1:


Well, I finally found out a solution,

I resorted on Request and StringIO and got rid off the open('my_file', 'rd') command

from urllib2 import Request
from StringIO import StringIO

url = 'my_url'

open = urllib2.urlopen(Request(url)).read()
memoryFile = StringIO(open)

parser = PDFParser(memoryFile)

That way Python considers the url as a file (to say so).



来源:https://stackoverflow.com/questions/22429193/using-pdfminer-python-with-online-pdf-files-encode-the-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!