When is it wise to use regular expressions with HTML? [closed]

后端未结

关注

 10  894

再見小時候

相关标签:

10条回答

感动是毒

2020-12-10 16:09

Jeff Atwood discusses it extensively in his blog posts entitled Programming Is Hard Let's Go Shopping and Parsing HTML The Cthulhu Way.

"So, yes, generally speaking, it is a bad idea to use regular expressions when parsing HTML. We should be teaching neophyte developers that, absolutely. Even though it's an apparently neverending job. But we should also be teaching them the very real difference between parsing HTML and the simple expedience of processing a few strings. And how to tell which is the right approach for the task at hand."

Find more details in the posts mentioned above.

0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2020-12-10 16:15
Obviously, in the most simple cases like
```
<a>Test</a>
```
you might get along with a regex. But even then, a perfectly valid HTML tag could come in so many different varieties:
```
< A > Test</a>                // match
< a href="test">   Test</a>   // match
< A TEST="test"/>             // no match
< a href="test<">Test</A>     // invalid input - catch that with a regex!
```
that the regex to catch them reliably gets HUGE. A DOM based parser will parse it, give you a proper error message if it fails, and provide stable results.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-10 16:17

If you can guarantee that the pattern you need to match is within a single HTML tag, then maybe you could create a regular expression to match it.

In other words, not when you need an expression to find matching tag/endtags and not when the content you need to match might contain nested tags, comments, CDATA sections, etc.

0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2020-12-10 16:17
One thing worth keeping in mind is that there are two main sources of objection to processing HTML with regular expressions. One source has to do with the probability of junk HTML that is unpredictably malformed. This is itself a legitimate reason to be skeptical when approaching HTML processing with regex, and tosses out a lot of use cases from the start. The problem is that this source is often used to "throw out the baby with the bathwater", and is also often conflated with the second main source of objection (and usually both left unsaid) even though they're completely unrelated.

The other main source of objection has to do with HTML language complexity exceeding some idealized, theoretical conception of "regular expression" that is too general to apply to many use cases—but is usually applied across the board. The objection goes something like this:
1. Truism: Regular expressions process regular grammars.
2. Truism: HTML is not a regular grammar.
3. HTML cannot be processed with regular expressions.
I think a lot of people really just take these truisms at face value without considering what's meant by them. Bill Karwin, in another answer here, mentioned some cases where HTML is not a regular grammar, but this argument falls apart when the context is a "regex" engine that has non-regular features (like back references, or even recursion). These features solve many of the "not a regular grammar" objections, but may still fail on malformed documents.

This distinction is rarely drawn and it's rarely pointed out that most modern "regular" expression libraries have capabilities far beyond regular language processing. I think these are important things to consider whenever evaluating "regular" expressions for the appropriate tool to process some HTML.
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-10 16:18

When you know what you're doing!

; )

0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-12-10 16:19
If the information that you are using has a regular grammar, then regexs are great. HTML doesn't have a regular grammar, so things are more complex.

Regexs are suitable if you absolutely 100% know what sort of thing you are looking for - replacing:
```
<tag>Info</tag>
```
with
```
<tag>Dave</tag>
```
In a document that you have complete control of would make sense, but real life HTML isn't like this.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

热议问题