nutch与起点R3集成之笔记（二）

在nutch与起点R3集成之笔记（一）中介绍了在起点R3中添加nutch要用到的索引字段，上述字段建好后，就可以通过nutch抓取一个或多个网站内容，并通过 bin/nutch solrindex 送到起点R3索引库中。

三、nutch安装与配置

1.安装nutch

先从http://www.apache.org/dist//nutch/apache-nutch-1.3-bin.zip下载nutch1.3，展开。nutch可以在linux环境下运行，也可以在windows环境下运行，也可以导入到eclipse中运行。

在linux环境下安装最简单，将展开后runtime/local目录下的内容上传到linux的一个目录下，如/opt/nutch1.3，同时将 /opt/nutch1.3/lib下的nutch-1.3.jar copy到 /opt/nutch1.3目录，并改名为 nutch-1.3.job，并chmod +x /opt/nutch1.3/bin。同时要有JDK环境，并在profile中设置JAVA_HOME，PATH中有JDK的bin路径。在 /opt/nutch1.3目录键入 bin/nutch ,出现如下提示：

[root@test nutch-1.3]# bin/nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  convdb            convert crawl db from pre-0.9 format
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  index             run the indexer on parsed segments and linkdb
  solrindex         run the solr indexer on parsed segments and linkdb
  merge             merge several segment indexes
  dedup             remove duplicates from a set of segment indexes
  solrdedup         remove duplicates from solr
  plugin            load a plugin and run one of its classes main()
  server            run a search server
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

表示安装成功。如果要安装成hadoop模式，还需要从网上将hadoop一些运行脚本拷贝到bin目录下。

在windows环境下，必须安装linux运行模拟环境软件cygwin，从http://www.cygwin.org/cygwin/下载安装cygwin。在cygwin下运行nutch跟linux需要的配置时一样的，需要设置 java_home，path等等。

在enlipse环境下，如何导入nutch1.3，网上有很多介绍，但很多是错的。其中一个重要的步骤是在构建路径时要将conf放在路径顺序中最前面，如下图：

并建立好主类为org.apache.nutch.crawl.Crawl的java运行应用程序，如下图：

对应的自变量设置为：

2.配置nutch-site.xml

无论是在linux下，在cygwin下，还是在eclipse环境里，首先需要修改conf中nutch-site.xml文件，在nutch-site.xml中加入：

<property>
  <name>http.agent.name</name>
  <value>nutch-1.3</value>
 </property>

<property>
  <name>http.robots.agents</name>
  <value>nutch-1.3,*</value>
</property>


<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika|js|zip|swf|rss)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

同时在在eclipse环境下，还需要在nutch-site.conf文件里加入：

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

3.配置solrindex-mapping.xml

同时，修改nutch1.3的conf中solrindex-mapping.xml文件，把nutch的索引字段与起点R3的定义的索引字段进行映射。内容如下：

<?xml version="1.0" encoding="UTF-8"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<mapping>
	<!-- Simple mapping of fields created by Nutch IndexingFilters
	     to fields defined (and expected) in Solr schema.xml.

             Any fields in NutchDocument that match a name defined
             in field/@source will be renamed to the corresponding
             field/@dest.
             Additionally, if a field name (before mapping) matches
             a copyField/@source then its values will be copied to 
             the corresponding copyField/@dest.

             uniqueKey has the same meaning as in Solr schema.xml
             and defaults to "id" if not defined.
         -->
	<fields>
		<field dest="title" source="title"/>
		<field dest="text" source="content"/>
		<field dest="lastModified" source="lastModified"/>	
		<field dest="type" source="type"/>	
		<field dest="site" source="site"/>
		<field dest="anchor" source="anchor"/>
		<field dest="host" source="host"/>		
		<field dest="segment" source="segment"/>
		<field dest="boost" source="boost"/>
		<field dest="tstamp" source="tstamp"/>
		<field dest="url" source="url"/>
		<field dest="id" source="digest"/>
		<copyField source="digest" dest="digest"/>
	</fields>
	<uniqueKey>id</uniqueKey>
</mapping>

4.配置 regex-urlfilter.xml

修改url过滤器，保证你要采集的网站，能不会被url过滤器给过滤掉，如要抓取新浪网站内容 ,所以在nutch的conf的regex-urlfilter.xml里加入一条：

+^http://www.sina

regex-urlfilter.xml内容如下：

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://www.sina.