Strip all HTML tags except links

后端 未结 6 915
小蘑菇
小蘑菇 2020-11-29 03:29

I am trying to write a regular expression to strip all HTML with the exception of links (the and tags respectively. It does n

相关标签:
6条回答
  • 2020-11-29 03:54

    How about

    <[^a](.|\n)+?>
    

    ?

    0 讨论(0)
  • 2020-11-29 03:56

    In general there are problems with this approach. Regexes are best for 'flat' text matches - nested data pushes regex engines into areas for which they are not designed. General HTML parsing needs a parser not a regex engine (Google for the difference between regular and context-free languages if you want the full technical details).

    It is easy to strip out all tags by replacing /</ and />/ with the empty string or their entity equivalents but selectively filtering HTML using regexes will be vulnerable to a wide range of accidental or malicious inputs breaking things.

    0 讨论(0)
  • 2020-11-29 04:05

    I keep going on about it, but there's no way I can recommend regexr too often. It's fantastic for testing this type of things.

    0 讨论(0)
  • 2020-11-29 04:10

    Here you go:

    {<(?!i|b|h[1-6]|/i|/b|/h[1-6][\s|>|/])[^>]*>}
    
    0 讨论(0)
  • 2020-11-29 04:15
    <(?!\/?a(?=>|\s.*>))\/?.*?>
    

    Try this. Had something similar for p tags. Worked for them so don't see why not. Uses negative lookahead to check that it doesn't match a (prefixed with an optional / character) where (using positive lookahead) a (with optional / prefix) is followed by a > or a space, stuff and then >. This then matches up until the next > character. Put this in a subst with

    s/<(?!\/?a(?=>|\s.*>))\/?.*?>//g;
    

    This should leave only the opening and closing a tags

    0 讨论(0)
  • 2020-11-29 04:15

    strip_tags() does this.

    Here, I am including all <a><p><font><b><i><sup> tags and outputting a tidied version:

    cat input.htm | tr -d '\n' | php -r '$input=fgets(STDIN); echo strip_tags($input,"<a><p><font><b><i><sup>");' | tidy -i -wrap 0 -o output.htm
    
    0 讨论(0)
提交回复
热议问题