问题
i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:
<div id=something>
me specific tag
</div>
indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.
any idea?
回答1:
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
Here are some tips to plugin:
- read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
- in your plugin extend the ParseFilter and IndexingFilter.
- in YourParseFilter you can use NodeWalker to find your specific div
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
- Solr config file schema.xml (and restart Solr)
field name="your_specific_tag" type="string" stored="true" indexed="true"
- Nutch config file schema.xml (don't know if it is realy neccessary)
- Nutch config file solrindex-mapping.xml
field dest="your_specific_tag" source="your_specific_tag"
回答2:
u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...
回答3:
You can use one of these custom plugins to parse xml files based on xpath (or css selectors):
- https://github.com/BayanGroup/nutch-custom-search
- http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
回答4:
You may want to check Nutch Plugin which should allow you to extract an element from a web page.
来源:https://stackoverflow.com/questions/12338967/how-to-parse-html-with-nutch-and-index-specific-tag-to-solr