Using default and custom stop words with Apache's Lucene (weird output)

前端 未结 1 616
我寻月下人不归
我寻月下人不归 2021-01-23 00:24

I\'m removing stop words from a String, using Apache\'s Lucene (8.6.3) and the following Java 8 code:

private static final String CONTENTS = "contents";         


        
1条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-23 00:53

    I will tackle this in two parts:

    • stop-words
    • preserving original case

    Handling the Combined Stop Words

    To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:

    import org.apache.lucene.analysis.en.EnglishAnalyzer;
    
    ...
    
    final List stopWords = Arrays.asList("short", "test");
    final CharArraySet stopSet = new CharArraySet(stopWords, true);
    
    CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
    stopSet.addAll(enStopSet);
    

    The above code simply takes the English stopwords bundled with Lucene and merges then with your list.

    That gives the following output:

    [bla]
    

    Handling Word Case

    This is a bit more involved. As you have noticed, the StandardAnalyzer includes a step in which all words are converted to lower case - so we can't use that.

    Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.

    So, let's assume you have a file called stopwords.txt. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.

    You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).

    My test file is just this:

    short
    this
    is
    a
    test
    the
    him
    it
    

    I also prefer to use the CustomAnalyzer for something like this, as it lets me build an analyzer very simply.

    import org.apache.lucene.analysis.custom.CustomAnalyzer;
    
    ...
    
    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("stop",
                    "ignoreCase", "true",
                    "words", "stopwords.txt",
                    "format", "wordset")
            .build();
    

    This does the following:

    1. It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer, which takes care of tokenizing on Unicode whitespace, and handling punctuation.

    2. It applies the stopword list. Note the use of true for the ignoreCase attribute, and the reference to the stop-word file. The format of wordset means "one word per line" (there are other formats, also).

    The key here is that there is nothing in the above chain which changes word case.

    So, now, using this new analyzer, the output is as follows:

    [Bla]
    

    Final Notes

    Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.

    But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).

    I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:

          
              
                  
                    src/main/java  
                      
                        **/*.java  
                      
                  
              
         
    

    This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.

    Final note - I did not investigate why you were getting that truncated [thi] token. If I get a chance I will take a closer look.


    Follow-Up Questions

    After combining I have to use the StandardAnalyzer, right?

    Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.

    I want to keep the stop word file on a specific non-imported path - how to do that?

    You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):

    import java.nio.file.Path;
    import java.nio.file.Paths;
    
    ...
    
    Path resources = Paths.get("/path/to/resources/directory");
    
    Analyzer analyzer = CustomAnalyzer.builder(resources)
            .withTokenizer("icu")
            .addTokenFilter("stop",
                    "ignoreCase", "true",
                    "words", "stopwords.txt",
                    "format", "wordset")
            .build();
    

    Instead of using .builder() we now use .builder(resources).

    0 讨论(0)
提交回复
热议问题