Nutch not crawling URLs except the one specified in seed.txt

前端 未结 2 2030
悲&欢浪女
悲&欢浪女 2021-01-15 19:53

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don\

相关标签:
2条回答
  • 2021-01-15 20:09

    Got that working after trying multiple things in last 2 days.Here is the solution:

    Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :

    <property>
      <name>http.content.limit</name>
      <value>-1</value>
      <description>The length limit for downloaded content using the http://
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
      </description>
    </property>
    
    0 讨论(0)
  • 2021-01-15 20:14

    You may try to tweak properties available in conf/nutch-default.xml. maybe control the number of outlinks your want or modify fetch properties. If you decide to overwrite any property, copy that info to conf/nutch-site.xml and put new value there.

    0 讨论(0)
提交回复
热议问题