How to implement similar html page scrubber like Arc90's Readability or Instapaper?

前端 未结 3 1248
故里飘歌
故里飘歌 2021-01-31 23:45

I\'m interested to find out how to scrub a html page and present it nicely -- remove all the clutters and reformat the main text into a very readable format -- like http://lab.a

3条回答
  •  醉酒成梦
    2021-02-01 00:25

    Readability is not a simple parser, it use complex algorithm to retrieve only the required components, if you are a not a guru at programming i would suggest you use their free service highlighted below.

    you can request for a developer api from readability (http://www.readability.com/publishers/api)

    If you request for the parser it will do exactly what you want to achieve, and that is to extract content from sites. Just remember to give them a good enough reason to allow you to use their API.

    A query to their parsing service will look like the following

    https://www.readability.com/api/content/v1/parser?url={url to be parsed here}&token={your api key here}

    The request will return a response like:

    HTTP/1.0 200 OK { "domain": "blog.readability.com", "author": "Richard Ziade", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",

    "short_url": "http://rdd.me/kbgr5a1k",
    "title": "Step Up & Be Heard: Readability Ideas", 
    "total_pages": 1, 
    "word_count": 175, 
    "content": "
    \n \n
    \n\t

    When we launched Readability [snip] ...

    \n
    ", "date_published": "2011-02-22 00:00:00", "next_page_id": null, "rendered_pages": 1 }

    For the hard core guys out there, checkout readability nodeJS,ruby and python port from here http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html

    Happy coding

提交回复
热议问题