问题
I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ?
回答1:
The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it.
Alternatively, you could implement a custom ParseFilter which would copy the content of the page into a metadata key value and configure that field to be indexed via indexer.md.mapping in the configuration file.
Either way, you'd need to modify ES_indexInit.sh so that the field in ES gets indexed and/or stored the way you want it.
.
来源:https://stackoverflow.com/questions/47626395/can-i-store-html-content-of-webpage-in-storm-crawler