I\'m interested to find out how to scrub a html page and present it nicely -- remove all the clutters and reformat the main text into a very readable format -- like http://lab.a
Readability is not a simple parser, it use complex algorithm to retrieve only the required components, if you are a not a guru at programming i would suggest you use their free service highlighted below.
you can request for a developer api from readability (http://www.readability.com/publishers/api)
If you request for the parser it will do exactly what you want to achieve, and that is to extract content from sites. Just remember to give them a good enough reason to allow you to use their API.
A query to their parsing service will look like the following
https://www.readability.com/api/content/v1/parser?url={url to be parsed here}&token={your api key here}
The request will return a response like:
HTTP/1.0 200 OK { "domain": "blog.readability.com", "author": "Richard Ziade", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",
"short_url": "http://rdd.me/kbgr5a1k", "title": "Step Up & Be Heard: Readability Ideas", "total_pages": 1, "word_count": 175, "content": "
\n \n", "date_published": "2011-02-22 00:00:00", "next_page_id": null, "rendered_pages": 1 }\n\t\nWhen we launched Readability [snip] ...
For the hard core guys out there, checkout readability nodeJS,ruby and python port from here http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html
Happy coding