Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

前端 未结 3 619
感动是毒
感动是毒 2021-01-14 10:03

I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of

相关标签:
3条回答
  • 2021-01-14 10:10
    for p in paragraphs(text=True):
        print p
    

    Additionally you could use api.php instead of index.php:

    #!/usr/bin/env python
    import sys
    import time
    import urllib, urllib2
    import xml.etree.cElementTree as etree
    
    # prepare request
    maxattempts = 5 # how many times to try the request before giving up
    maxlag = 5 # seconds http://www.mediawiki.org/wiki/Manual:Maxlag_parameter
    params = dict(action="query", format="xml", maxlag=maxlag,
                  prop="revisions", rvprop="content", rvsection=0,
                  titles="data_mining")
    request = urllib2.Request(
        "http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params), 
        headers={"User-Agent": "WikiDownloader/1.2",
                 "Referer": "http://stackoverflow.com/q/8044814"})
    # make request
    for _ in range(maxattempts):
        response = urllib2.urlopen(request)
        if response.headers.get('MediaWiki-API-Error') == 'maxlag':
            t = response.headers.get('Retry-After', 5)
            print "retrying in %s seconds" % (t,)
            time.sleep(float(t))
        else:
            break # ready to read
    else: # exhausted all attempts
        sys.exit(1)
    
    # download & parse xml 
    tree = etree.parse(response)
    
    # find rev data 
    rev_data = tree.findtext('.//rev')
    if not rev_data:
        print 'MediaWiki-API-Error:', response.headers.get('MediaWiki-API-Error')
        tree.write(sys.stdout)
        print
        sys.exit(1)
    
    print(rev_data)
    

    Output

    {{Distinguish|analytics|information extraction|data analysis}}
    
    '''Data mining''' (the analysis step of the '''knowledge discovery in databases..
    
    0 讨论(0)
  • 2021-01-14 10:14

    These seem to work on Beautiful soup tag nodes. The parentNode gets modified so the relevant tags are removed. The found tags are also returned as lists back to the caller.

    @staticmethod
    def seperateCommentTags(parentNode):
        commentTags = []
        for descendant in parentNode.descendants:
            if isinstance(descendant, element.Comment):
                commentTags.append(descendant)
        for commentTag in commentTags:
            commentTag.extract()
        return commentTags
    
    @staticmethod
    def seperateScriptTags(parentNode):
        scripttags = parentNode.find_all('script')
        scripts = []
        for scripttag in scripttags:
            script = scripttag.extract()
            if script is not None:
                scripts.append(script)
        return scripts
    
    0 讨论(0)
  • 2021-01-14 10:25

    This is how you could do it with lxml (and the lovely requests):

    import requests
    import lxml.html as lh
    from BeautifulSoup import UnicodeDammit
    
    URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
    HEADERS = {'User-agent': 'Mozilla/5.0'}
    
    def lhget(*args, **kwargs):
        r = requests.get(*args, **kwargs)
        html = UnicodeDammit(r.content).unicode
        tree = lh.fromstring(html)
        return tree
    
    def remove(el):
        el.getparent().remove(el)
    
    tree = lhget(URL, headers=HEADERS)
    
    el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]
    
    for ref in el.xpath("//sup[@class='reference']"):
        remove(ref)
    
    print lh.tostring(el, pretty_print=True)
    
    print el.text_content()
    
    0 讨论(0)
提交回复
热议问题