Extract the first paragraph from a Wikipedia article (Python)

前端 未结 10 1509
闹比i
闹比i 2020-11-28 01:36

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

<
相关标签:
10条回答
  • 2020-11-28 02:11

    As others have said, one approach is to use the wikimedia API and urllib or urllib2. The code fragments below are part of what I used to extract what is called the "lead" section, which has the article abstract and the infobox. This will check if the returned text is a redirect instead of actual content, and also let you skip the infobox if present (in my case I used different code to pull out and format the infobox.

    contentBaseURL='http://en.wikipedia.org/w/index.php?title='
    
    def getContent(title):
        URL=contentBaseURL+title+'&action=raw&section=0'
        f=urllib.urlopen(URL)
        rawContent=f.read()
        return rawContent
    
    infoboxPresent = 0
    # Check if a redirect was returned.  If so, go to the redirection target
        if rawContent.find('#REDIRECT') == 0:
            rawContent = getFullContent(title)
            # extract the redirection title
            # Extract and format the Infobox
            redirectStart=rawContent.find('#REDIRECT[[')+11   
            count = 0
            redirectEnd = 0
            for i, char in enumerate(rawContent[redirectStart:-1]):
                if char == "[": count += 1
                if char == "]}":
                    count -= 1
                    if count == 0:
                        redirectEnd = i+redirectStart+1
                        break
            redirectTitle = rawContent[redirectStart:redirectEnd]
            print 'redirectTitle is: ',redirectTitle
            rawContent = getContent(redirectTitle)
    
        # Skip the Infobox
        infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
        count = 0
        infoboxEnd = 0
        for i, char in enumerate(rawContent[infoboxStart:-1]):
            if char == "{": count += 1
            if char == "}":
                count -= 1
                if count == 0:
                    infoboxEnd = i+infoboxStart+1
                    break
    
        if infoboxEnd <> 0:
            rawContent = rawContent[infoboxEnd:]
    

    You'll be getting back the raw text including wiki markup, so you'll need to do some clean up. If you just want the first paragraph, not the whole first section, look for the first new line character.

    0 讨论(0)
  • 2020-11-28 02:12

    If you want library suggestions, BeautifulSoup, urllib2 come to mind. Answered on SO before: Web scraping with Python.

    I have tried urllib2 to get a page from Wikipedia. But, it was 403 (forbidden). MediaWiki provides API for Wikipedia, supporting various output formats. I haven't used python-wikitools, but may be worth a try. http://code.google.com/p/python-wikitools/

    0 讨论(0)
  • 2020-11-28 02:12

    Try pattern.

    pip install pattern
    
    from pattern.web import Wikipedia
    article = Wikipedia(language="af").search('Kaapstad', throttle=10)
    print article.string
    
    0 讨论(0)
  • 2020-11-28 02:18

    What I did is this:

    import urllib
    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    article= "Albert Einstein"
    article = urllib.quote(article)
    
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this
    
    resource = opener.open("http://en.wikipedia.org/wiki/" + article)
    data = resource.read()
    resource.close()
    soup = BeautifulSoup(data)
    print soup.find('div',id="bodyContent").p
    
    0 讨论(0)
提交回复
热议问题