Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

问题

I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.

I tried to enable stop word filtering with two different approaches.

Approach #1:

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

Approach #2:

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

The full code is available here:
https://stackoverflow.com/a/36237769/462347

My questions:

Why Lucene doesn't filter stop words?
How can I enable the stop words filtering in Lucene 5.5 / 6.0?

回答1:

Just tested both approach 1 and approach 2, and they both seem to filter out stop words just fine. Here is how I tested it:

public static void main(String[] args) throws IOException, ParseException, org.apache.lucene.queryparser.surround.parser.ParseException 
{
     StandardTokenizer stdToken = new StandardTokenizer();
     stdToken.setReader(new StringReader("Some stuff that is in need of analysis"));
     TokenStream tokenStream;

     //You're code starts here
     tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
     tokenStream.reset();
     //And ends here

     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
     while (tokenStream.incrementToken()) {
         System.out.println(token.toString());
     }
     tokenStream.close();
}

Results:

some
stuff
need
analysis

Which has eliminated the four stop words in my sample.

回答2:

The problem was that I expected that the default Lucene's stop words list will be much more broader.

Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();

来源：https://stackoverflow.com/questions/36241051/apache-lucene-doesnt-filter-stop-words-despite-the-usage-of-stopanalyzer-and-st

标签

java

apache

lucene

information-retrieval

stop-words