问题
Working on Storm Crawler 1.13 and elastic search 6.5.2. How to restrict the crawler not to crawl/index the special characters � � � � � ��� �� � •
回答1:
An easy way to do this is to write a ParseFilter like
ParseData pd = parse.get(URL);
String text = pd.getText();
// remove chars
pd.setText(text);
This will get called on documents parsed by JSoup or Tika. Have a look at the parse filters in the repository for examples.
来源:https://stackoverflow.com/questions/54096045/explicit-special-characters-from-crawling