Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

前端 未结 6 1155
轻奢々
轻奢々 2020-12-07 11:07

I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:

http://en.wikipedia.org/wiki/OpenCola_(drink)

This

相关标签:
6条回答
  • 2020-12-07 11:26

    I made a workaround for this using php which is not blocked by the site I needed.

    it can be accessed like this:

    path='http://phillippowers.com/redirects/get.php? 
    file=http://website_you_need_to_load.com'
    req = urllib2.Request(path)
    response = urllib2.urlopen(req)
    vdata = response.read()
    

    This will return the html code to you

    0 讨论(0)
  • 2020-12-07 11:29

    To debug this, you'll need to trap that exception.

    try:
        f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
    except urllib2.HTTPError, e:
        print e.fp.read()
    

    When I print the resulting message, it includes the following

    "English

    Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes. "

    0 讨论(0)
  • 2020-12-07 11:30

    Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?

    0 讨论(0)
  • 2020-12-07 11:35

    As Jochen Ritzel mentioned, Wikipedia blocks bots.

    However, bots will not get blocked if they use the PHP api. To get the Wikipedia page titled "love":

    http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content

    0 讨论(0)
  • 2020-12-07 11:41

    Wikipedias stance is:

    Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database.

    That is why Python is blocked. You're supposed to download data dumps.

    Anyways, you can read pages like this in Python 2:

    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib2.urlopen( req )
    print con.read()
    

    Or in Python 3:

    import urllib
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib.request.urlopen( req )
    print(con.read())
    
    0 讨论(0)
  • Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.

    http://wolfprojects.altervista.org/changeua.php

    0 讨论(0)
提交回复
热议问题