Fetch a Wikipedia article with Python

前端 未结 10 1842
余生分开走
余生分开走 2020-11-27 15:37

I try to fetch a Wikipedia article with Python\'s urllib:

f = urllib.urlopen(\"http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes\")         


        
相关标签:
10条回答
  • 2020-11-27 16:09

    In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

    'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

    Or, if you want the HTML code, use 'action=render' like in:

    'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

    You can also define a section to get just part of the content with something like 'section=3'.

    You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

    Refer to MediaWiki's FAQ if you need more information.

    0 讨论(0)
  • 2020-11-27 16:11

    You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

    Straight from the examples

    import urllib2
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
    page = infile.read()
    
    0 讨论(0)
  • 2020-11-27 16:11

    Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

    0 讨论(0)
  • 2020-11-27 16:13
    import urllib
    s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()
    

    This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

    0 讨论(0)
  • 2020-11-27 16:15

    You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

    0 讨论(0)
  • 2020-11-27 16:17

    Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

    If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

    As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

    0 讨论(0)
提交回复
热议问题