Using Nutch to crawl a specified URL list

后端未结

关注

 2  1730

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch auto

相关标签:

2条回答

礼貌的吻别

2021-01-16 06:52

Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

0 讨论(0)

南方客

2021-01-16 06:56
- Delete the crawl and urls directory (if created before)
- Create and Update the seed file ( where URLs are listed 1URL per row)
- Restart the crawling process
Command
```
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
```
- urllist - Directory where seed file (url list) is present
- crawl - Directory name
Even if the problem persists, try to delete your nutch folder and restart the whole process.
0 讨论(0)
发布评论:

提交评论
- 加载中...