How to implement similar html page scrubber like Arc90's Readability or Instapaper?

前端 未结 3 1249
故里飘歌
故里飘歌 2021-01-31 23:45

I\'m interested to find out how to scrub a html page and present it nicely -- remove all the clutters and reformat the main text into a very readable format -- like http://lab.a

相关标签:
3条回答
  • 2021-02-01 00:25

    Readability is not a simple parser, it use complex algorithm to retrieve only the required components, if you are a not a guru at programming i would suggest you use their free service highlighted below.

    you can request for a developer api from readability (http://www.readability.com/publishers/api)

    If you request for the parser it will do exactly what you want to achieve, and that is to extract content from sites. Just remember to give them a good enough reason to allow you to use their API.

    A query to their parsing service will look like the following

    https://www.readability.com/api/content/v1/parser?url={url to be parsed here}&token={your api key here}

    The request will return a response like:

    HTTP/1.0 200 OK { "domain": "blog.readability.com", "author": "Richard Ziade", "url": "http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas/",

    "short_url": "http://rdd.me/kbgr5a1k",
    "title": "Step Up & Be Heard: Readability Ideas", 
    "total_pages": 1, 
    "word_count": 175, 
    "content": "<div>\n  \n<div class=\"entry\">\n\t<p>When we launched Readability [snip] ...</div>\n</div>", 
    "date_published": "2011-02-22 00:00:00", 
    "next_page_id": null, 
    "rendered_pages": 1 }
    

    For the hard core guys out there, checkout readability nodeJS,ruby and python port from here http://arrix.blogspot.com/2010/11/server-side-readability-with-nodejs.html

    Happy coding

    0 讨论(0)
  • 2021-02-01 00:33

    If the web page or site in question has good use of semantic elements and structure, you could just use a different CSS stylesheet, which can drastically change the layout and display completely.

    0 讨论(0)
  • 2021-02-01 00:46

    https://github.com/jiminoc/goose/wiki does something like you're asking, source code is openly available along with unit tests

    0 讨论(0)
提交回复
热议问题