Using Nutch to crawl a specified URL list

后端 未结 2 1730
星月不相逢
星月不相逢 2021-01-16 06:32

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch auto

相关标签:
2条回答
  • 2021-01-16 06:52

    Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)

    <property>
      <name>db.update.additions.allowed</name>
      <value>false</value>
      <description>If true, updatedb will add newly discovered URLs, if false
      only already existing URLs in the CrawlDb will be updated and no new
      URLs will be added.
      </description>
    </property>
    
    0 讨论(0)
  • 2021-01-16 06:56
    • Delete the crawl and urls directory (if created before)
    • Create and Update the seed file ( where URLs are listed 1URL per row)
    • Restart the crawling process

    Command

    nutch crawl urllist -dir crawl -depth 3 -topN 1000000
    
    • urllist - Directory where seed file (url list) is present
    • crawl - Directory name

    Even if the problem persists, try to delete your nutch folder and restart the whole process.

    0 讨论(0)
提交回复
热议问题