How to implement similar html page scrubber like Arc90's Readability or Instapaper?

前端未结

关注

 3  1249

I\'m interested to find out how to scrub a html page and present it nicely -- remove all the clutters and reformat the main text into a very readable format -- like http://lab.a

相关标签:

3条回答

醉酒成梦

2021-02-01 00:25
Readability is not a simple parser, it use complex algorithm to retrieve only the required components, if you are a not a guru at programming i would suggest you use their free service highlighted below.

you can request for a developer api from readability (http://www.readability.com/publishers/api)

If you request for the parser it will do exactly what you want to achieve, and that is to extract content from sites. Just remember to give them a good enough reason to allow you to use their API.

A query to their parsing service will look like the following

https://www.readability.com/api/content/v1/parser?url={url to be parsed here}&token={your api key here}

The request will return a response like:
HTTP/1.0 200 OK { "domain": "blog.readability.com", "author": "Richard Ziade", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",
```
"short_url": "http://rdd.me/kbgr5a1k",
"title": "Step Up & Be Heard: Readability Ideas", 
"total_pages": 1, 
"word_count": 175, 
"content": "<div>\n  \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", 
"date_published": "2011-02-22 00:00:00", 
"next_page_id": null, 
"rendered_pages": 1 }
```
For the hard core guys out there, checkout readability nodeJS,ruby and python port from here http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html

Happy coding
0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2021-02-01 00:33

If the web page or site in question has good use of semantic elements and structure, you could just use a different CSS stylesheet, which can drastically change the layout and display completely.

0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2021-02-01 00:46

https://github.com/jiminoc/goose/wiki does something like you're asking, source code is openly available along with unit tests

0 讨论(0)
发布评论:

提交评论
- 加载中...