Strategy for how to crawl/index frequently updated webpages?

后端 未结 4 1595
别跟我提以往
别跟我提以往 2021-01-30 09:57

I\'m trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store an

4条回答
  •  花落未央
    2021-01-30 10:08

    Try to keep some per frontpage stats on update frequency. Detecting an update is easy just store the ETag/Last-Modified and send back If-None-Match/If-Updated-Since headers with your next request. Keeping a running average update frequency (say for the last 24 crawls) allows you to fairly accurately determine the update frequency of the frontpages.

    After having crawled a frontpage you would determine when the next update is expected and put a new crawl-job in a bucket right around that time (buckets of one hour are typically a good balance between fast and polite). Every hour you would simply take the corresponding bucket and add the jobs to your job queue. Like this you can have any number of crawlers and still have allot of control over the scheduling of the individual crawls.

提交回复
热议问题