Extract the first paragraph from a Wikipedia article (Python)

前端 未结 10 1513
闹比i
闹比i 2020-11-28 01:36

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

<
相关标签:
10条回答
  • 2020-11-28 01:52

    First, I promise I am not being snarky.

    Here's a previous question that might be of use: Fetch a Wikipedia article with Python

    In this someone suggests using the wikipedia high level API, which leads to this question:

    Is there a Wikipedia API?

    0 讨论(0)
  • Some time ago I made two classes for get Wikipedia articles in plain text. I know that they aren't the best solution, but you can adapt it to your needs:

        wikipedia.py
        wiki2plain.py

    You can use it like this:

    from wikipedia import Wikipedia
    from wiki2plain import Wiki2Plain
    
    lang = 'simple'
    wiki = Wikipedia(lang)
    
    try:
        raw = wiki.article('Uruguay')
    except:
        raw = None
    
    if raw:
        wiki2plain = Wiki2Plain(raw)
        content = wiki2plain.text
    
    0 讨论(0)
  • 2020-11-28 01:59

    I wrote a Python library that aims to make this very easy. Check it out at Github.

    To install it, run

    $ pip install wikipedia
    

    Then to get the first paragraph of an article, just use the wikipedia.summary function.

    >>> import wikipedia
    >>> print wikipedia.summary("Albert Einstein", sentences=2)
    

    prints

    Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his mass–energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

    As far as how it works, wikipedia makes a request to the Mobile Frontend Extension of the MediaWiki API, which returns mobile friendly versions of Wikipedia articles. To be specific, by passing the parameters prop=extracts&exsectionformat=plain, the MediaWiki servers will parse the Wikitext and return a plain text summary of the article you are requesting, up to and including the entire page text. It also accepts the parameters exchars and exsentences, which, not surprisingly, limit the number of characters and sentences returned by the API.

    0 讨论(0)
  • 2020-11-28 01:59

    The relatively new REST API has a summary method that is perfect for this use, and does a lot of the things mentioned in the other answers here (e.g. removing wikicode). It even includes an image and geocoordinates if applicable.

    Using the lovely requests module and Python 3:

    import requests
    r = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Amsterdam")
    page = r.json()
    print(page["extract"]) # Returns 'Amsterdam is the capital and...'
    
    0 讨论(0)
  • 2020-11-28 02:01

    Wikipedia runs a MediaWiki extension that provides exactly this functionality as an API module. TextExtracts implements action=query&prop=extracts with options to return the first N sentences and/or just the introduction, as HTML or plain text.

    Here's the API call you want to make, try it: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext=&redirects=&formatversion=2

    • action=query&prop=extracts to request this info
    • (ex)sentences=2, (ex)intro=, (ex)plaintext, are parameters to the module (see the first link for its API doc) asking for two sentences from the intro as plain text; leave off the latter for HTML.
    • redirects=(true) so if you ask for "titles=Einstein" you'll get the Albert Einstein page info
    • formatversion=2 for a cleaner format in UTF-8.

    There are various libraries that wrap invoking the MediaWiki action API, such as the one in DGund's answer, but it's not too hard to make the API calls yourself.

    Page info in search results discusses getting this text extract, along with getting a description and lead image for articles.

    0 讨论(0)
  • 2020-11-28 02:05

    Try a combination of urllib to fetch the site and BeautifulSoup or lxml to parse the data.

    0 讨论(0)
提交回复
热议问题