Processing a HTML file using Python

前端 未结 5 486
半阙折子戏
半阙折子戏 2021-01-26 08:10

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line

Hello World!

.I want to retain
相关标签:
5条回答
  • 2021-01-26 08:56

    Parse the HTML using BeautifulSoup, then only retrieve the text.

    0 讨论(0)
  • 2021-01-26 09:02

    make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

    off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

    0 讨论(0)
  • 2021-01-26 09:09

    Beautiful Soup is great for parsing html!

    You might not require it now, but it's worth learning to use it. Will help you in the future too.

    0 讨论(0)
  • 2021-01-26 09:11

    You can make the match non-greedy: '<.*?>'

    You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

    0 讨论(0)
  • 2021-01-26 09:11

    Use a parser, either lxml or BeautifulSoup:

    import lxml.html
    print lxml.html.fromstring(mystring).text_content()
    

    Related questions:

    Using regular expressions to parse HTML: why not?

    Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

    0 讨论(0)
提交回复
热议问题