发表新帖

发表新帖

Processing a HTML file using Python

前端未结

关注

 5  491

半阙折子戏

I wanted to remove all the tags in HTML file. For that I used re module of python. For example, consider the line

`Hello World!`

.I want to retain

相关标签:

5条回答

礼貌的吻别

2021-01-26 08:56

Parse the HTML using BeautifulSoup, then only retrieve the text.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2021-01-26 09:02

make it non-greedy: http://docs.python.org/release/2.6/howto/regex.html#greedy-versus-non-greedy

off-topic: the approach that uses regular expressions is error prone. it cannot handle cases when angle brackets do not represent tags. I recommend http://lxml.de/

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2021-01-26 09:09

Beautiful Soup is great for parsing html!

You might not require it now, but it's worth learning to use it. Will help you in the future too.

0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2021-01-26 09:11

You can make the match non-greedy: '<.*?>'

You also need to be careful, HTML is a crafty beast, and can thwart your regexes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2021-01-26 09:11
Use a parser, either lxml or BeautifulSoup:
```
import lxml.html
print lxml.html.fromstring(mystring).text_content()
```
Related questions:

Using regular expressions to parse HTML: why not?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题