i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http:
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
Here are some tips to plugin:
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
field name="your_specific_tag" type="string" stored="true" indexed="true"
field dest="your_specific_tag" source="your_specific_tag"
You can use one of these custom plugins to parse xml files based on xpath (or css selectors):
You may want to check Nutch Plugin which should allow you to extract an element from a web page.
u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...