how to parse html with nutch and index specific tag to solr?

后端未结

关注

 4  1723

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http:

相关标签:

4条回答

盖世英雄少女心

2021-01-13 08:01
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:
- read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
- in your plugin extend the ParseFilter and IndexingFilter.
- in YourParseFilter you can use NodeWalker to find your specific div
- your parsed informations put into page metadata like this
  
  page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
- in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
  
  doc.add("your_specific_tag", value);
- most important!!!!!
- put your_specific_tag to fileds of:
  - Solr config file schema.xml (and restart Solr)
  field name="your_specific_tag" type="string" stored="true" indexed="true"
  - Nutch config file schema.xml (don't know if it is realy neccessary)
  - Nutch config file solrindex-mapping.xml
  field dest="your_specific_tag" source="your_specific_tag"
0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2021-01-13 08:01
You can use one of these custom plugins to parse xml files based on xpath (or css selectors):
- https://github.com/BayanGroup/nutch-custom-search
- http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2021-01-13 08:03

You may want to check Nutch Plugin which should allow you to extract an element from a web page.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2021-01-13 08:18

u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...

0 讨论(0)
发布评论:

提交评论
- 加载中...