Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

前端未结

关注

 3  622

感动是毒 2021-01-14 10:03

I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of

3条回答

暖寄归人 (楼主)

2021-01-14 10:14

These seem to work on Beautiful soup tag nodes. The parentNode gets modified so the relevant tags are removed. The found tags are also returned as lists back to the caller.

@staticmethod
def seperateCommentTags(parentNode):
    commentTags = []
    for descendant in parentNode.descendants:
        if isinstance(descendant, element.Comment):
            commentTags.append(descendant)
    for commentTag in commentTags:
        commentTag.extract()
    return commentTags

@staticmethod
def seperateScriptTags(parentNode):
    scripttags = parentNode.find_all('script')
    scripts = []
    for scripttag in scripttags:
        script = scripttag.extract()
        if script is not None:
            scripts.append(script)
    return scripts

0 讨论(0)

查看其它3个回答