So i\'m writing a quick perl script that cleans up some HTML code and runs it through a html -> pdf program. I want to lose as little information as possible, so I\'d like t
Must this be done with regex? Parsing any markup language (or even CSV) with regex is fraught with error. If you can, try to utilize a standard library:
http://search.cpan.org/dist/HTML-Parser/Parser.pm
Otherwise you risk the revenge of Cthulu:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
(Yes, the article leaves room for some simple string-manipulation, so I think your soul is safe, though. :-)