Python: urlretrieve PDF downloading

馋奶兔 提交于 2019-12-04 14:33:17

I would suggest trying out requests. It is a really nice library that hides all of the implementation behind a simple api.

>>> import requests
>>> req = requests.get("http://www.mathworks.com/moler/random.pdf")
>>> len(req.content)
167633
>>> req.headers
{'content-length': '167633', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.3 (Red Hat) mod_jk/1.2.31 PHP/5.3.13 Phusion_Passenger/3.0.9 mod_perl/2.0.4 Perl/v5.8.8', 'last-modified': 'Fri, 15 Feb 2008 17:11:12 GMT', 'connection': 'keep-alive', 'etag': '"30863b-28ed1-446357e3d4c00"', 'date': 'Sun, 03 Feb 2013 05:53:21 GMT', 'content-type': 'application/pdf'}

By the way, the reason you are only getting a 15kb download is because your url is wrong. It should be

http://www.mathworks.com/moler/random.pdf

But you are GETing

http://www.mathworks.com/moler/random.pdf/

>>> import requests
>>> c = requests.get("http://www.mathworks.com/moler/random.pdf/")
>>> len(c.content)
14390

to write the file to disc:

myfile = open("out.pdf", "w")
myfile.write(req.content)

Maybe its a bit late, but you could try this: Just writing the contents to a new file and reading it using textract as doing so without it gave me unwanted text containing '#$'.

import requests
import textract
url = "The url which downloads the file"
response = requests.get(url)
with open('./document.pdf', 'wb') as fw:
    fw.write(response.content)
text = textract.process("./document.pdf")
print('Result: ', text)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!