how to parse html with nutch and index specific tag to solr?

后端 未结 4 1721
别那么骄傲
别那么骄傲 2021-01-13 07:25

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http:

相关标签:
4条回答
  • 2021-01-13 08:01

    I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

    Here are some tips to plugin:

    • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
    • in your plugin extend the ParseFilter and IndexingFilter.
    • in YourParseFilter you can use NodeWalker to find your specific div
    • your parsed informations put into page metadata like this

      page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

    • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

      doc.add("your_specific_tag", value);

    • most important!!!!!

    • put your_specific_tag to fileds of:

      • Solr config file schema.xml (and restart Solr)

      field name="your_specific_tag" type="string" stored="true" indexed="true"

      • Nutch config file schema.xml (don't know if it is realy neccessary)
      • Nutch config file solrindex-mapping.xml

      field dest="your_specific_tag" source="your_specific_tag"

    0 讨论(0)
  • 2021-01-13 08:01

    You can use one of these custom plugins to parse xml files based on xpath (or css selectors):

    • https://github.com/BayanGroup/nutch-custom-search
    • http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
    0 讨论(0)
  • 2021-01-13 08:03

    You may want to check Nutch Plugin which should allow you to extract an element from a web page.

    0 讨论(0)
  • 2021-01-13 08:18

    u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...

    0 讨论(0)
提交回复
热议问题