Strategy for how to crawl/index frequently updated webpages?

后端未结

关注

 4  1595

别跟我提以往 2021-01-30 09:57

I\'m trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store an

4条回答

花落未央 (楼主)

2021-01-30 10:08

Try to keep some per frontpage stats on update frequency. Detecting an update is easy just store the ETag/Last-Modified and send back If-None-Match/If-Updated-Since headers with your next request. Keeping a running average update frequency (say for the last 24 crawls) allows you to fairly accurately determine the update frequency of the frontpages.

After having crawled a frontpage you would determine when the next update is expected and put a new crawl-job in a bucket right around that time (buckets of one hour are typically a good balance between fast and polite). Every hour you would simply take the corresponding bucket and add the jobs to your job queue. Like this you can have any number of crawlers and still have allot of control over the scheduling of the individual crawls.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...