问题
I'm trying to write a regular expression for my html parser.
I want to match a html tag with given attribute (eg. <div>
with class="tab news selected"
) that contains one or more <a href>
tags. The regexp should match the entire tag (from <div>
to </div>
). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.
I'm using boost regex libraries.
回答1:
You may also find these questions helpful:
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Can you provide an example of parsing HTML with your favorite parser?
回答2:
You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.
回答3:
As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.
回答4:
Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href>
tags and keywords.
I'm using these regexps:
/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/
for locating <a href>
tags and:
/<a[^\n]*href[[^\n]*>/searched keyword/</a>/
for locating links
(BTW can it be done better? - I suck at regex ;))
What I need now is locating tags containing <a href>
's and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.
回答5:
Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.
This takes two regexps and a state variable.
SGML tags valid characters are [A-Za-z_:]
So: /<[A-Za-z_:]+>/ matches a tag.
来源:https://stackoverflow.com/questions/792679/how-to-write-a-regular-expression-for-html-parsing