Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags
for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as λ
, λ
or λ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™
for example is not valid, yet many browsers will interpret it as ™
.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
- Feed string into a robust HTML parser
- Get parsed (and fully decoded) string back