I\'m interested to find out how to scrub a html page and present it nicely -- remove all the clutters and reformat the main text into a very readable format -- like http://lab.a
Readability is not a simple parser, it use complex algorithm to retrieve only the required components, if you are a not a guru at programming i would suggest you use their free service highlighted below.
you can request for a developer api from readability (http://www.readability.com/publishers/api)
If you request for the parser it will do exactly what you want to achieve, and that is to extract content from sites. Just remember to give them a good enough reason to allow you to use their API.
A query to their parsing service will look like the following
https://www.readability.com/api/content/v1/parser?url={url to be parsed here}&token={your api key here}
The request will return a response like:
HTTP/1.0 200 OK { "domain": "blog.readability.com", "author": "Richard Ziade", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",
"short_url": "http://rdd.me/kbgr5a1k", "title": "Step Up & Be Heard: Readability Ideas", "total_pages": 1, "word_count": 175, "content": "<div>\n \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", "date_published": "2011-02-22 00:00:00", "next_page_id": null, "rendered_pages": 1 }
For the hard core guys out there, checkout readability nodeJS,ruby and python port from here http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html
Happy coding
If the web page or site in question has good use of semantic elements and structure, you could just use a different CSS stylesheet, which can drastically change the layout and display completely.
https://github.com/jiminoc/goose/wiki does something like you're asking, source code is openly available along with unit tests