问题
I work on R, and I will want to extract all HTML tag closed from a PlainTextDocument. I use a gsub method with a regex :
gsub("<?!([^<]/*)>"," ",fm,perl=TRUE,ignore.case=TRUE)
But, the slash '/' isn't evaluated.
I think I wasn't very clear.
Here is what I need to do :
I have a text (a HTML document) and I want to only keep the tags (<>
and </>
). I thought using gsub would be a good idea, but maybe you have a better solution.
回答1:
The wording of your question is unclear, and your regex doesn't make much sense, but if you just want to match anything that looks like an HTML tag, this should do it:
"<[^<>]+>"
That will match both opening and closing tags (e.g., <tag attr="value">
and </tag>
). If you want to match only self-closing tags (e.g., <tag />
), this should work:
"<[^<>]+/>"
Others have suggested that the slash (/
) has special meaning and needs to be escaped, but that's not true. If you were using Perl, you might use this command to do the substitution:
s/<[^<>]+\/>/ /g
But the slash itself has no special meaning; I only had to escape it because I used it as the regex delimiter. I could just as easily use a different delimiter:
s~<[^<>]+/>~ ~g
But R doesn't support regexes at the language level like Perl does; the regex and the replacement are written in the form of string literals, just like they are (for example) in Java and C#. And unlike PHP, it doesn't require you to add delimiters anyway, as in:
preg_replace("/<[^<>]+\/>/", " ")
But even PHP allows you to choose your own delimiter:
preg_replace('~<[^<>]+/>~', ' ')
Before anyone calls me out on this, I know <[^<>]+>
is flawed--that there is in fact no such thing as a correct regex for HTML tags. This will do in many cases, but the only truly reliable way to parse HTML is with a dedicated HTML parser.
回答2:
it likely needs to be 'escaped': \\/
来源:https://stackoverflow.com/questions/9847333/extract-all-html-tag-closed-with-a-regex-expression