Nutch2.1+Hbase+Solr快速搭建一个爬虫和搜索引擎（快速，基本2小时内搞定）

说明：这种方式是为了快速体验或者数据量较小的情况，不适合数据量大的生产环境

环境准备：

Centos7
Nutch2.2.1
JAVA1.8
ant1.9.14
hbase0.90.4 (单机版)
solr7.7

相关下载地址：

链接: https://pan.baidu.com/s/1Tut2CcKoJ9-G-HBq8zexMQ 提取码: v75v

开始安装

默认安装好的jdk、ant（其实就是解压配置好环境变量不会的可以百度一下）

安装hbase单机版

下载解压

	wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz
	tar zxf hbase-0.90.4.tar.gz
	# 或者直接使用我提供的软件包

配置

	<configuration>
	  <property>
		<name>hbase.rootdir</name>
		<value>/data/hbase</value>
	  </property>
	  <property>
		<name>hbase.zookeeper.property.dataDir</name>
		<value>/data/zookeeper</value>
	  </property>
	</configuration>

说明：hbase.rootdir目录是用来存放HBase的相关信息的，默认值是/tmp/hbase-${user.name}/hbase； hbase.zookeeper.property.dataDir目录是用来存放zookeeper（HBase内置了zookeeper）的相关信息的，默认值是/tmp/hbase-${user.name}/zookeeper 3. 启动

./bin/start-hbase.sh

solr安装配置
1. 下载安装
```
	wget https://mirrors.cnnic.cn/apache/lucene/solr/7.7.2/solr-7.7.2-src.tgz
	tar -zxvf solr-7.7.2-src.tgz
	./bin/solr  start  -force   //启动
```
1. 添加 core 参考 https://blog.csdn.net/weixin_39082031/article/details/78924909
添加完记得重启start 换位 restart

Nutch编辑安装（前置ant配置别忘了）

下载

	wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
	tar zxf apache-nutch-2.2.1-src.tar.gz

配置修改

conf/nutch-site.xml

	<property>
	  <name>storage.data.store.class</name>
	  <value>org.apache.gora.hbase.store.HBaseStore</value>
	  <description>Default class for storing data</description>
	</property>

ivy/ivy.xml

	<!-- Uncomment this to use HBase as Gora backend. -->
	<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

conf/gora.properties

	gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

编译 ant runtime 这里特别慢，可以自己百度优化一下ivy速度，也可以就这样下载，遇到失败的，可以自己把包下载下来放到报错的路径

成功后：生成两个目录 runtime和build，下面的配置文件修改都是改的 runtime/local下面的文件
添加种子url

	#在你想存储的目录
	mkdir /data/urls
	vim seed.txt
	#添加要抓取的url
	http://www.dxy.cn/

设置url过滤规则（可选）

	#注释掉这一行
	# skip URLs containing certain characters as probable queries, etc.
	#-[?*!@=]
	# accept anything else
	#注释掉这行
	#+.
	+^http:\/\/heart\.dxy\.cn\/article\/[0-9]+$

配置agent名字（必须配置不然会报错）

	<property>
	  <name>http.agent.name</name>
	  <value>My Nutch Spider</value>
	</property>

最后一步配置，让solr支持nutch存储的数据结构（schema），修改/data/solr-7.7.2/server/solr/jkj_core/conf/managed-schema 文件，然后重启solr

新增配置部分（放到<schema>里面就可以）

	<!-- 新增字段 for nutch  start-->

	<fieldType name="url" class="solr.TextField" positionIncrementGap="100">
		  <analyzer>
			<tokenizer class="solr.StandardTokenizerFactory"/>
			   <filter class="solr.LowerCaseFilterFactory"/>
			   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/>
		  </analyzer>
		</fieldType>

		<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>

		<!-- A Trie based date field for faster date range queries and date faceting. -->
		<fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>

		<!-- core fields -->
		<field name="batchId" type="string" stored="true" indexed="false"/>
		<field name="digest" type="string" stored="true" indexed="false"/>
		<field name="boost" type="pfloat" stored="true" indexed="false"/>

		<!-- fields for index-basic plugin -->
		<field name="host" type="url" stored="false" indexed="true"/>
		<field name="url" type="url" stored="true" indexed="true" required="true"/>
		<field name="orig" type="url" stored="true" indexed="true" />
		<!-- stored=true for highlighting, use term vectors  and positions for fast highlighting -->
		<field name="content" type="text_general" stored="true" indexed="true"/>
		<field name="title" type="text_general" stored="true" indexed="true"/>
		<field name="cache" type="string" stored="true" indexed="false"/>
		<field name="tstamp" type="date" stored="true" indexed="false"/>

		<!-- catch-all field -->
		<field name="text" type="text_general" stored="false" indexed="true" multiValued="true"/>

		<!-- fields for index-anchor plugin -->
		<field name="anchor" type="text_general" stored="true" indexed="true"
			multiValued="true"/>

		<!-- fields for index-more plugin -->
		<field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
		<field name="contentLength" type="string" stored="true" indexed="false"/>
		<field name="lastModified" type="date" stored="true" indexed="false"/>
		<field name="date" type="tdate" stored="true" indexed="true"/>

		<!-- fields for languageidentifier plugin -->
		<field name="lang" type="string" stored="true" indexed="true"/>

		<!-- fields for subcollection plugin -->
		<field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>

		<!-- fields for feed plugin (tag is also used by microformats-reltag)-->
		<field name="author" type="string" stored="true" indexed="true"/>
		<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
		<field name="feed" type="string" stored="true" indexed="true"/>
		<field name="publishedDate" type="date" stored="true" indexed="true"/>
		<field name="updatedDate" type="date" stored="true" indexed="true"/>

		<!-- fields for creativecommons plugin -->
		<field name="cc" type="string" stored="true" indexed="true" multiValued="true"/>

		<!-- fields for tld plugin -->    
		<field name="tld" type="string" stored="false" indexed="false"/>

		<!-- 新增字段 for nutch  end-->

启动nutch 抓取

	# bin目录为 nutch下的runtime/local 下面的bin
	./bin/crawl ~/urls/ jkj http://192.168.1.61:8983/solr/jkj_core 2
	~/urls/ 是我存储抓取文件的目录  jkj 是我指定的存储在在hbase中的id（可以这么理解），自动创建表
	http://192.168.1.61:8983/solr/jkj_core solr创建的collection的地址
	2 为抓取的深度

7.通过solr或者 hbase查看结果

来源：https://my.oschina.net/haitaohu/blog/3111221

标签

nutch

Apache Gora

Apache HBase

solr