Extract the first paragraph from a Wikipedia article (Python)

前端未结

关注

 10  1514

闹比i

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

<

相关标签:

10条回答

既然无缘

2020-11-28 02:11

As others have said, one approach is to use the wikimedia API and urllib or urllib2. The code fragments below are part of what I used to extract what is called the "lead" section, which has the article abstract and the infobox. This will check if the returned text is a redirect instead of actual content, and also let you skip the infobox if present (in my case I used different code to pull out and format the infobox.

contentBaseURL='http://en.wikipedia.org/w/index.php?title='

def getContent(title):
    URL=contentBaseURL+title+'&action=raw&section=0'
    f=urllib.urlopen(URL)
    rawContent=f.read()
    return rawContent

infoboxPresent = 0
# Check if a redirect was returned.  If so, go to the redirection target
    if rawContent.find('#REDIRECT') == 0:
        rawContent = getFullContent(title)
        # extract the redirection title
        # Extract and format the Infobox
        redirectStart=rawContent.find('#REDIRECT[[')+11   
        count = 0
        redirectEnd = 0
        for i, char in enumerate(rawContent[redirectStart:-1]):
            if char == "[": count += 1
            if char == "]}":
                count -= 1
                if count == 0:
                    redirectEnd = i+redirectStart+1
                    break
        redirectTitle = rawContent[redirectStart:redirectEnd]
        print 'redirectTitle is: ',redirectTitle
        rawContent = getContent(redirectTitle)

    # Skip the Infobox
    infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
    count = 0
    infoboxEnd = 0
    for i, char in enumerate(rawContent[infoboxStart:-1]):
        if char == "{": count += 1
        if char == "}":
            count -= 1
            if count == 0:
                infoboxEnd = i+infoboxStart+1
                break

    if infoboxEnd <> 0:
        rawContent = rawContent[infoboxEnd:]

You'll be getting back the raw text including wiki markup, so you'll need to do some clean up. If you just want the first paragraph, not the whole first section, look for the first new line character.

0 讨论(0)

滥情空心

2020-11-28 02:12

If you want library suggestions, BeautifulSoup, urllib2 come to mind. Answered on SO before: Web scraping with Python.

I have tried urllib2 to get a page from Wikipedia. But, it was 403 (forbidden). MediaWiki provides API for Wikipedia, supporting various output formats. I haven't used python-wikitools, but may be worth a try. http://code.google.com/p/python-wikitools/

0 讨论(0)
发布评论:

提交评论
- 加载中...

遇见更好的自我

2020-11-28 02:12

Try pattern.

pip install pattern

from pattern.web import Wikipedia
article = Wikipedia(language="af").search('Kaapstad', throttle=10)
print article.string

0 讨论(0)

猫巷女王i

2020-11-28 02:18

What I did is this:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p

0 讨论(0)

上一页 1 2