How can I clean HTML tags out of a ColdFusion string?

前端未结

关注

 6  1740

南方客 2021-02-14 00:16

I am looking for a quick way to parse HTML tags out of a ColdFusion string. We are pulling in an RSS feed, that could potentially have anything in it. We are then doing some man

6条回答

面向向阳花 (楼主)

2021-02-14 00:51
Disclaimer I am a fierce advocate of using a proper parser (instead of regex) to parse HTML. However, this question isn't about parsing HTML, but about destroying it. For all tasks that go beyond that, use a parser.

I think your regex is good. As long as there is nothing more than removing all HTML tags from the input, using a regex like yours is safe.

Anything else would probably be more hassle than it's worth, but you could write a small function that loops through the string char-by-char once and removes everything that's within tag brackets — e.g.:
- switch on a "inTag" flag as soon as you encounter a "<" character,
- switch it off as soon as you encounter ">"
- copy characters to the output string as long as the flag is off
- for performance, use a StringBuilder Java object instead of string concatenation
For a high-demand part of your app, this may be faster than the regex. But the regex is clean and probably fast enough.

Maybe this modified regex has some advantages for you:
```
<[^>]*(?:>|$)
```
- catches unclosed tags at the end of the string
- [^>]* is better than (.|\n)
The use of REReplaceNoCase() is unnecessary when there are no actual letters in the pattern. Case-insensitive regex matching is slower than doing it case-sensitively.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...