发表新帖

发表新帖

Filter out HTML tags and resolve entities in python

前端未结

关注

 8  1749

Because regular expressions scare me, I\'m trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

相关标签:

8条回答

情深已故

2020-12-03 00:38

How about parsing the HTML data and extracting the data with the help of the parser ?

I'd try something like the author described in chapter 8.3 in the Dive Into Python book

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2020-12-03 00:42
You might need something more complicated than a regular expression. Web pages often have angle brackets that aren't part of a tag, like this:
```
 <div>5 < 7</div>
```
Stripping the tags with regex will return the string "5 " and treat
```
 < 7</div>
```
as a single tag and strip it out.

I suggest looking for already-written code that does this for you. I did a search and found this: http://zesty.ca/python/scrape.html It also can resolve HTML entities.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

热议问题