There\'s a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g.,
I have come across http://www.keyvan.net/2010/08/php-readability/
Last year I ported Arc90′s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.
As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online.
For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.
It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place.