One thing worth keeping in mind is that there are two main sources of objection to processing HTML with regular expressions. One source has to do with the probability of junk HTML that is unpredictably malformed. This is itself a legitimate reason to be skeptical when approaching HTML processing with regex, and tosses out a lot of use cases from the start. The problem is that this source is often used to "throw out the baby with the bathwater", and is also often conflated with the second main source of objection (and usually both left unsaid) even though they're completely unrelated.
The other main source of objection has to do with HTML language complexity exceeding some idealized, theoretical conception of "regular expression" that is too general to apply to many use cases—but is usually applied across the board. The objection goes something like this:
- Truism: Regular expressions process regular grammars.
- Truism: HTML is not a regular grammar.
- HTML cannot be processed with regular expressions.
I think a lot of people really just take these truisms at face value without considering what's meant by them. Bill Karwin, in another answer here, mentioned some cases where HTML is not a regular grammar, but this argument falls apart when the context is a "regex" engine that has non-regular features (like back references, or even recursion). These features solve many of the "not a regular grammar" objections, but may still fail on malformed documents.
This distinction is rarely drawn and it's rarely pointed out that most modern "regular" expression libraries have capabilities far beyond regular language processing. I think these are important things to consider whenever evaluating "regular" expressions for the appropriate tool to process some HTML.