I am aware that regex is not ideal for use with HTML strings and I have looked at the PHP Simple HTML DOM Parser but still believe this is the way to go. All the HTML tags w
Unfortunately I think the logic you need is still more complex than text pattern matching :-/
I know it's not the answer you want to hear, but you'll probably get better results with a DOM model.
Here's a discussion of this topic elsewhere: http://coderzone.org/forum/index.php?topic=84.0
Is it possible to just run the filter once, so you don't end up with dupes? Or could the original corpus also include links?