I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch auto
Set this property in nutch-site.xml
. (by default its true so it adds outlinks to the crawldb)
<property>
<name>db.update.additions.allowed</name>
<value>false</value>
<description>If true, updatedb will add newly discovered URLs, if false
only already existing URLs in the CrawlDb will be updated and no new
URLs will be added.
</description>
</property>
Command
nutch crawl urllist -dir crawl -depth 3 -topN 1000000
Even if the problem persists, try to delete your nutch folder and restart the whole process.