How to write a regular expression for html parsing?

↘锁芯ラ 提交于 2019-12-29 08:12:09

问题


I'm trying to write a regular expression for my html parser.

I want to match a html tag with given attribute (eg. <div> with class="tab news selected" ) that contains one or more <a href> tags. The regexp should match the entire tag (from <div> to </div>). I always seem to get "memory exhausted" errors - my program probably takes every tag it can find as a matching one.

I'm using boost regex libraries.


回答1:


You may also find these questions helpful:

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Can you provide an example of parsing HTML with your favorite parser?




回答2:


You should probably look at this question re. regexps and HTML. The gist is that using regular expressions to parse HTML is not by any means an ideal solution.




回答3:


As others have said, don't use regexes if at all possible. If your code is actually XHTML (i.e. it is also well-formed XML) aI can recommend both the Xerces and Expat XML parsers, which will do a much betterv job for you than regexes.




回答4:


Maybe regexps aren't the best solution, but I'm already using like five different libraries and boost does fine when it comes to locating <a href> tags and keywords.

I'm using these regexps:

/<a[^\n]*/searched attribute/[^\n]*>[^\n]*</a>/ for locating <a href> tags and:

/<a[^\n]*href[[^\n]*>/searched keyword/</a>/ for locating links

(BTW can it be done better? - I suck at regex ;))

What I need now is locating tags containing <a href>'s and I think regexps will do all right - maybe I'll need to write my own parsing function as piotr said.




回答5:


Do as flex does: match <div> with a case insensitive match, and put your parser in a "div matched" state, keep processing input until </div> and reset state.

This takes two regexps and a state variable.

SGML tags valid characters are [A-Za-z_:]

So: /<[A-Za-z_:]+>/ matches a tag.



来源:https://stackoverflow.com/questions/792679/how-to-write-a-regular-expression-for-html-parsing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!