发表新帖

发表新帖

Recrawl URL with Nutch just for updated sites

前端未结

关注

 3  1690

生来不讨喜

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

相关标签:

3条回答

無奈伤痛

2020-12-31 15:09

You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.

Article describes the same in detail.

0 讨论(0)
发布评论:

提交评论
- 加载中...
半阙折子戏

2020-12-31 15:28

what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

This is discussed on : How to recrawle nutch

I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-12-31 15:29

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题