Can i store html content of webpage in storm crawler?

问题

I am using strom-crawler-elastic. I can able to see the fetched urls and status of those. Configuration change in ES_IndexInit.sh file gives only url,title, host, text. But can i store the entire html content with html tags ?

回答1:

The ES IndexerBolt gets the content of pages from the ParseFilter but does not do anything with it. One option would be to modify the code so that it pulls the content field from the incoming tuples and indexes it.

Alternatively, you could implement a custom ParseFilter which would copy the content of the page into a metadata key value and configure that field to be indexed via indexer.md.mapping in the configuration file.

Either way, you'd need to modify ES_indexInit.sh so that the field in ES gets indexed and/or stored the way you want it.

来源：https://stackoverflow.com/questions/47626395/can-i-store-html-content-of-webpage-in-storm-crawler

标签

web-crawler

elasticsearch-5

stormcrawler

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!