I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of
These seem to work on Beautiful soup tag nodes. The parentNode gets modified so the relevant tags are removed. The found tags are also returned as lists back to the caller.
@staticmethod
def seperateCommentTags(parentNode):
commentTags = []
for descendant in parentNode.descendants:
if isinstance(descendant, element.Comment):
commentTags.append(descendant)
for commentTag in commentTags:
commentTag.extract()
return commentTags
@staticmethod
def seperateScriptTags(parentNode):
scripttags = parentNode.find_all('script')
scripts = []
for scripttag in scripttags:
script = scripttag.extract()
if script is not None:
scripts.append(script)
return scripts