Instapaper-like algorithm

后端未结

关注

 4  424

北恋

Does anyone of an algorithm that extracts contents from a webpage? like instapaper?

相关标签:

4条回答

盖世英雄少女心

2021-01-29 18:15

If you just want all the content and none of the formatting in Python

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib import urlopen
>>> soup = BeautifulSoup(urlopen("http://www.python.org/").read())
>>> contents = ''.join(soup.findAll(text=True))

does the trick

0 讨论(0)

刺人心

2021-01-29 18:19

boilerpipe is opensource java. the algorithm is published in a scientific paper so you can read how well it does compared to other algorithms. reading around it seems to be one of the best.

0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2021-01-29 18:31

there is an open source application that parses the text of an article out from any webpage

https://github.com/jiminoc/goose/wiki

should do the trick

0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2021-01-29 18:36
There are two steps to what Instapaper does:
1. Find main content block on the page (excluding headers, footers, menus etc)
2. From this content block extract and format the text
To find the content block (typically some html block element, like a div containing the key page text content) Instapaper uses an algorithm much like the one used by readability. You can look at the source of readability.js to see what's going on, but at its core it tries to find the area on the page with the highest text/link ratio, although it has some other simple scoring metrics too (e.g. off the top of my head, things like ratio of text to commas, para elements etc) that go into the heuristics.

Once you have identified the root node element, with the relevant content, you'll need to format it, if you want you can just pull the node element containing the text out of the source document and insert it into yours, but in reality you'll probably want to remove existing styles and apply your own, for a standard look and feel. If you want to output as nice text-only you can use Jericho's Renderer.

update1: I should also mention something else Instapaper does - which is follow the 'pagination' links (the "next" or "1", "2", "3" links) of the article to their conclusion, so that a piece that may span many pages in the original will be rendered to you as a single document.

update2 I recently came across this comparison of text extraction algorithms
0 讨论(0)
发布评论:

提交评论
- 加载中...