How can I retrieve files with User-Agent headers in Python 3?

喜欢而已 提交于 2019-12-11 12:12:47

问题


I'm trying to write a (simple) piece of code to download files off the internet. The problem is, some of these files are on websites that block the default python User-Agent headers. For example:

import urllib.request as html
html.urlretrieve('http://stackoverflow.com', 'index.html')

returns

urllib.error.HTTPError: HTTP Error 403: Forbidden`

Normally, I would set the headers in the request, such as:

import urllib.request as html
request = html.Request('http://stackoverflow.com', headers={"User-Agent":"Firefox"})
response = html.urlopen(request)

however, as urlretrieve doesn't work with requests for some reason, this isn't an option.

Are there any simple-ish solutions to this (that don't include importing a library such as requests)? I've noticed that urlretrieve is part of the legacy interface posted over from Python 2, is there anything I should be using instead?

I tried creating a custom FancyURLopener class to handle retrieving files, but that caused more problems than it solved, such as creating empty files for links that 404.


回答1:


You can subclass URLopener and set the version class variable to a different user-agent then continue using urlretrieve.

Or you can simply use your second method and save the response to a file only after checking that code == 200.



来源:https://stackoverflow.com/questions/32115815/how-can-i-retrieve-files-with-user-agent-headers-in-python-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!