Does anyone of an algorithm that extracts contents from a webpage? like instapaper?
If you just want all the content and none of the formatting in Python
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup = BeautifulSoup(urlopen("http://www.python.org/").read())
>>> contents = ''.join(soup.findAll(text=True))
does the trick
boilerpipe is opensource java. the algorithm is published in a scientific paper so you can read how well it does compared to other algorithms. reading around it seems to be one of the best.
there is an open source application that parses the text of an article out from any webpage
https://github.com/jiminoc/goose/wiki
should do the trick
There are two steps to what Instapaper does:
To find the content block (typically some html block element, like a div containing the key page text content) Instapaper uses an algorithm much like the one used by readability. You can look at the source of readability.js to see what's going on, but at its core it tries to find the area on the page with the highest text/link ratio, although it has some other simple scoring metrics too (e.g. off the top of my head, things like ratio of text to commas, para elements etc) that go into the heuristics.
Once you have identified the root node element, with the relevant content, you'll need to format it, if you want you can just pull the node element containing the text out of the source document and insert it into yours, but in reality you'll probably want to remove existing styles and apply your own, for a standard look and feel. If you want to output as nice text-only you can use Jericho's Renderer.
update1: I should also mention something else Instapaper does - which is follow the 'pagination' links (the "next" or "1", "2", "3" links) of the article to their conclusion, so that a piece that may span many pages in the original will be rendered to you as a single document.
update2 I recently came across this comparison of text extraction algorithms