Rendered HTML to plain text using Python

后端 未结 2 800
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-29 18:49

I\'m trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

Some text more text

相关标签:
2条回答
  • 2020-12-29 18:58

    I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

    On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

    I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

    Answer from @Helge (nltk).

    import nltk
    
    %timeit nltk.clean_html(html)
    was returning 153 us per loop
    

    It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

    Answer above from @del

    betterHTML = html.decode(errors='ignore')
    %timeit html2text.html2text(betterHTML)
    %3.09 ms per loop
    
    0 讨论(0)
  • 2020-12-29 19:16

    BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

    import html2text
    html = open("foobar.html").read()
    print html2text.html2text(html)
    

    This outputs:

    Some text more text even more text
    
      * list item
      * yet another list item
    
    Some other text
    
      * list item
      * yet another list item
    
    0 讨论(0)
提交回复
热议问题