Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

前端 未结 3 622
感动是毒
感动是毒 2021-01-14 10:03

I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of

3条回答
  •  暖寄归人
    2021-01-14 10:14

    These seem to work on Beautiful soup tag nodes. The parentNode gets modified so the relevant tags are removed. The found tags are also returned as lists back to the caller.

    @staticmethod
    def seperateCommentTags(parentNode):
        commentTags = []
        for descendant in parentNode.descendants:
            if isinstance(descendant, element.Comment):
                commentTags.append(descendant)
        for commentTag in commentTags:
            commentTag.extract()
        return commentTags
    
    @staticmethod
    def seperateScriptTags(parentNode):
        scripttags = parentNode.find_all('script')
        scripts = []
        for scripttag in scripttags:
            script = scripttag.extract()
            if script is not None:
                scripts.append(script)
        return scripts
    

提交回复
热议问题