Nutch not crawling URLs except the one specified in seed.txt

前端未结

关注

 2  2030

I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don\

相关标签:

2条回答

南方客

2021-01-15 20:09
Got that working after trying multiple things in last 2 days.Here is the solution:

Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :
```
<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2021-01-15 20:14

You may try to tweak properties available in conf/nutch-default.xml. maybe control the number of outlinks your want or modify fetch properties. If you decide to overwrite any property, copy that info to conf/nutch-site.xml and put new value there.

0 讨论(0)
发布评论:

提交评论
- 加载中...