How can I clean HTML tags out of a ColdFusion string?

前端 未结 6 1740
南方客
南方客 2021-02-14 00:16

I am looking for a quick way to parse HTML tags out of a ColdFusion string. We are pulling in an RSS feed, that could potentially have anything in it. We are then doing some man

6条回答
  •  面向向阳花
    2021-02-14 00:51

    Disclaimer I am a fierce advocate of using a proper parser (instead of regex) to parse HTML. However, this question isn't about parsing HTML, but about destroying it. For all tasks that go beyond that, use a parser.


    I think your regex is good. As long as there is nothing more than removing all HTML tags from the input, using a regex like yours is safe.

    Anything else would probably be more hassle than it's worth, but you could write a small function that loops through the string char-by-char once and removes everything that's within tag brackets — e.g.:

    • switch on a "inTag" flag as soon as you encounter a "<" character,
    • switch it off as soon as you encounter ">"
    • copy characters to the output string as long as the flag is off
    • for performance, use a StringBuilder Java object instead of string concatenation

    For a high-demand part of your app, this may be faster than the regex. But the regex is clean and probably fast enough.

    Maybe this modified regex has some advantages for you:

    <[^>]*(?:>|$)
    
    • catches unclosed tags at the end of the string
    • [^>]* is better than (.|\n)

    The use of REReplaceNoCase() is unnecessary when there are no actual letters in the pattern. Case-insensitive regex matching is slower than doing it case-sensitively.

提交回复
热议问题