Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

前端未结

关注

 3  621

I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of

相关标签:

3条回答

北海茫月

2021-01-14 10:10

for p in paragraphs(text=True):
    print p

Additionally you could use api.php instead of index.php:

#!/usr/bin/env python
import sys
import time
import urllib, urllib2
import xml.etree.cElementTree as etree

# prepare request
maxattempts = 5 # how many times to try the request before giving up
maxlag = 5 # seconds http://www.mediawiki.org/wiki/Manual:Maxlag_parameter
params = dict(action="query", format="xml", maxlag=maxlag,
              prop="revisions", rvprop="content", rvsection=0,
              titles="data_mining")
request = urllib2.Request(
    "http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params), 
    headers={"User-Agent": "WikiDownloader/1.2",
             "Referer": "http://stackoverflow.com/q/8044814"})
# make request
for _ in range(maxattempts):
    response = urllib2.urlopen(request)
    if response.headers.get('MediaWiki-API-Error') == 'maxlag':
        t = response.headers.get('Retry-After', 5)
        print "retrying in %s seconds" % (t,)
        time.sleep(float(t))
    else:
        break # ready to read
else: # exhausted all attempts
    sys.exit(1)

# download & parse xml 
tree = etree.parse(response)

# find rev data 
rev_data = tree.findtext('.//rev')
if not rev_data:
    print 'MediaWiki-API-Error:', response.headers.get('MediaWiki-API-Error')
    tree.write(sys.stdout)
    print
    sys.exit(1)

print(rev_data)

Output

{{Distinguish|analytics|information extraction|data analysis}}

'''Data mining''' (the analysis step of the '''knowledge discovery in databases..

0 讨论(0)

暖寄归人

2021-01-14 10:14

These seem to work on Beautiful soup tag nodes. The parentNode gets modified so the relevant tags are removed. The found tags are also returned as lists back to the caller.

@staticmethod
def seperateCommentTags(parentNode):
    commentTags = []
    for descendant in parentNode.descendants:
        if isinstance(descendant, element.Comment):
            commentTags.append(descendant)
    for commentTag in commentTags:
        commentTag.extract()
    return commentTags

@staticmethod
def seperateScriptTags(parentNode):
    scripttags = parentNode.find_all('script')
    scripts = []
    for scripttag in scripttags:
        script = scripttag.extract()
        if script is not None:
            scripts.append(script)
    return scripts

0 讨论(0)

萌比男神i

2021-01-14 10:25

This is how you could do it with lxml (and the lovely requests):

import requests
import lxml.html as lh
from BeautifulSoup import UnicodeDammit

URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS = {'User-agent': 'Mozilla/5.0'}

def lhget(*args, **kwargs):
    r = requests.get(*args, **kwargs)
    html = UnicodeDammit(r.content).unicode
    tree = lh.fromstring(html)
    return tree

def remove(el):
    el.getparent().remove(el)

tree = lhget(URL, headers=HEADERS)

el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]

for ref in el.xpath("//sup[@class='reference']"):
    remove(ref)

print lh.tostring(el, pretty_print=True)

print el.text_content()

0 讨论(0)