Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

断了今生、忘了曾经 提交于 2019-12-21 21:34:10

问题


I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.

I tried to enable stop word filtering with two different approaches.

Approach #1:

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

Approach #2:

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

The full code is available here:
https://stackoverflow.com/a/36237769/462347

My questions:

  1. Why Lucene doesn't filter stop words?
  2. How can I enable the stop words filtering in Lucene 5.5 / 6.0?

回答1:


Just tested both approach 1 and approach 2, and they both seem to filter out stop words just fine. Here is how I tested it:

public static void main(String[] args) throws IOException, ParseException, org.apache.lucene.queryparser.surround.parser.ParseException 
{
     StandardTokenizer stdToken = new StandardTokenizer();
     stdToken.setReader(new StringReader("Some stuff that is in need of analysis"));
     TokenStream tokenStream;

     //You're code starts here
     tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
     tokenStream.reset();
     //And ends here

     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
     while (tokenStream.incrementToken()) {
         System.out.println(token.toString());
     }
     tokenStream.close();
}

Results:

some
stuff
need
analysis

Which has eliminated the four stop words in my sample.




回答2:


The problem was that I expected that the default Lucene's stop words list will be much more broader.

Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();


来源:https://stackoverflow.com/questions/36241051/apache-lucene-doesnt-filter-stop-words-despite-the-usage-of-stopanalyzer-and-st

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!