Explicit special characters from crawling

。_饼干妹妹 提交于 2020-01-06 12:25:33

问题


Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � •


回答1:


An easy way to do this is to write a ParseFilter like

        ParseData pd = parse.get(URL);
        String text = pd.getText();
        // remove chars
        pd.setText(text);

This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples.



来源:https://stackoverflow.com/questions/54096045/explicit-special-characters-from-crawling

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!