Using Nutch to crawl a specified URL list

后端 未结 2 1731
星月不相逢
星月不相逢 2021-01-16 06:32

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch auto

2条回答
  •  礼貌的吻别
    2021-01-16 06:52

    Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)

    
      db.update.additions.allowed
      false
      If true, updatedb will add newly discovered URLs, if false
      only already existing URLs in the CrawlDb will be updated and no new
      URLs will be added.
      
    
    

提交回复
热议问题