问题
I am trying to download some content using Python's urllib.request
. The following command yields an exception:
import urllib.request
print(urllib.request.urlopen("https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/").code)
result:
...
HTTPError: HTTP Error 403: Forbidden
if I use firefox or links (command line browser) I get the content and a status code of 200. If I use lynx, strange enough, I also get 403.
I expect all methods to work
- the same way
- successfully
Why is that not the case?
回答1:
Most likely the site is blocking people from scraping their sites. You can trick them at a basic level by including header info along with other stuff. See here for more info.
Quoting from: https://docs.python.org/3/howto/urllib2.html#headers
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
the_page = response.read()
There are many reasons why people don't want scripts to scrape their websites. It takes their bandwidth for one. They don't want people to benefit (money-wise) by making a scrape bot. Maybe they don't want you to copy their site information. You can also think of it as a book. Authors want people to read their books, but maybe some of them wouldn't want a robot to scan their books, to create an off copy, or maybe the robot might summarize it.
The second part of your question in the comment is to vague and broad to answer here as there are too many opinionated answers.
回答2:
I tried with this code and everything was okay.
I just added headers
to the request. See the example below:
from urllib.request import Request, urlopen, HTTPError
from time import sleep
def get_url_data(url = ""):
try:
request = Request(url, headers = {'User-Agent' :\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"})
response = urlopen(request)
data = response.read().decode("utf8")
return data
except HTTPError:
return None
url = "https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/"
for i in range(50):
d = get_url_data(url)
if d != None:
print("Attempt %d was a Success" % i)
else:
print("Attempt %d was a Failure" % i)
sleep(1)
Output:
Attempt 0 was a Success
Attempt 1 was a Success
Attempt 2 was a Success
Attempt 3 was a Success
Attempt 4 was a Success
Attempt 5 was a Success
Attempt 6 was a Success
Attempt 7 was a Success
Attempt 8 was a Success
Attempt 9 was a Success
...
Attempt 42 was a Success
Attempt 43 was a Success
Attempt 44 was a Success
Attempt 45 was a Success
Attempt 46 was a Success
Attempt 47 was a Success
Attempt 48 was a Success
Attempt 49 was a Success
来源:https://stackoverflow.com/questions/41469938/why-does-urllib-request-urlopen-sometimes-does-not-work-but-browsers-work