Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

≡放荡痞女 提交于 2019-12-04 16:02:15

Just tested both approach 1 and approach 2, and they both seem to filter out stop words just fine. Here is how I tested it:

public static void main(String[] args) throws IOException, ParseException, org.apache.lucene.queryparser.surround.parser.ParseException 
{
     StandardTokenizer stdToken = new StandardTokenizer();
     stdToken.setReader(new StringReader("Some stuff that is in need of analysis"));
     TokenStream tokenStream;

     //You're code starts here
     tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
     tokenStream.reset();
     //And ends here

     CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
     while (tokenStream.incrementToken()) {
         System.out.println(token.toString());
     }
     tokenStream.close();
}

Results:

some
stuff
need
analysis

Which has eliminated the four stop words in my sample.

The problem was that I expected that the default Lucene's stop words list will be much more broader.

Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!