Using Beautiful Soup Python module to replace tags with plain text

后端 未结 2 1960
花落未央
花落未央 2021-01-07 05:14

I am using Beautiful Soup to extract \'content\' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that\'s h

相关标签:
2条回答
  • 2021-01-07 05:21

    When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:

    import re
    def flatten_tags(s, tags):
       pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
       return pattern.sub("", s)
    

    The tags argument is either a single tag or a list of tags to be flattened.

    0 讨论(0)
  • 2021-01-07 05:35

    An approach that works for your specific example is:

    from BeautifulSoup import BeautifulSoup
    
    ht = '''
    <div id="abc">
        some long text goes <a href="/"> here </a> and hopefully it 
        will get picked up by the parser as content
    </div>
    '''
    soup = BeautifulSoup(ht)
    
    anchors = soup.findAll('a')
    for a in anchors:
      a.previousSibling.replaceWith(a.previousSibling + a.string)
    
    results = soup.findAll(text=lambda(x): len(x) > 20)
    
    print results
    

    which emits

    $ python bs.py
    [u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']
    

    Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

    I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!

    0 讨论(0)
提交回复
热议问题