Python: urlretrieve PDF downloading

僤鯓⒐⒋嵵緔 提交于 2020-01-01 15:35:14

问题


I am using urllib's urlretrieve() function in Python in order to try to grab some pdf's from websites. It has (at least for me) stopped working and is downloading damaged data (15 KB instead of 164 KB).

I have tested this with several pdf's, all with no success (ie random.pdf). I can't seem to get it to work, and I need to be able to download pdf's for the project I am working on.

Here is an example of the kind of code I am using to download the pdf's (and parse the text using pdftotext.exe):

def get_html(url): # gets html of page from Internet
    import os
    import urllib2
    import urllib
    from subprocess import call
    f_name = url.split('/')[-2] # get file name (url must end with '/')
    try:
        if f_name.split('.')[-1] == 'pdf': # file type
            urllib.urlretrieve(url, os.getcwd() + '\\' + f_name)
            call([os.getcwd() + '\\pdftotext.exe', os.getcwd() + '\\' + f_name]) # use xpdf to output .txt file
            return open(os.getcwd() + '\\' + f_name.split('.')[0] + '.txt').read()
        else:
            return urllib2.urlopen(url).read()
    except:
        print 'bad link: ' + url    
        return ""

I am a novice programmer, so any input would be great! Thanks


回答1:


I would suggest trying out requests. It is a really nice library that hides all of the implementation behind a simple api.

>>> import requests
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf")
>>> len(req.content)
167633
>>> req.headers
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

By the way, the reason you are only getting a 15kb download is because your url is wrong. It should be

http://www.mathworks.com/moler/random.pdf

But you are GETing

http://www.mathworks.com/moler/random.pdf/

>>> import requests
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/")
>>> len(c.content)
14390



回答2:


to write the file to disc:

myfile = open("out.pdf", "w")
myfile.write(req.content)



回答3:


Maybe its a bit late, but you could try this: Just writing the contents to a new file and reading it using textract as doing so without it gave me unwanted text containing '#$'.

import requests
import textract
url = "The url which downloads the file"
response = requests.get(url)
with open('./document.pdf', 'wb') as fw:
    fw.write(response.content)
text = textract.process("./document.pdf")
print('Result: ', text)


来源:https://stackoverflow.com/questions/14669827/python-urlretrieve-pdf-downloading

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!