Using default and custom stop words with Apache's Lucene (weird output)

对着背影说爱祢 提交于 2020-12-26 10:21:37

问题


I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:

private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    Analyzer analyzer = new StandardAnalyzer(stopSet);
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    System.out.println("Exception:\n");
    e.printStackTrace();
}

This outputs the desired result:

[this] [is] [a] [bla]

Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:

Analyzer analyzer = new EnglishAnalyzer(stopSet);

The output is:

[thi] [is] [a] [bla]

Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.

The following changes remove both the default and the custom stop words:

Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);

Question: What is the "right" way to do this? Is using the tokenStream within itself (see code above) going to cause problems?

Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?


回答1:


I will tackle this in two parts:

  • stop-words
  • preserving original case

Handling the Combined Stop Words

To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:

import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);

The above code simply takes the English stopwords bundled with Lucene and merges then with your list.

That gives the following output:

[bla]

Handling Word Case

This is a bit more involved. As you have noticed, the StandardAnalyzer includes a step in which all words are converted to lower case - so we can't use that.

Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.

So, let's assume you have a file called stopwords.txt. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.

You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).

My test file is just this:

short
this
is
a
test
the
him
it

I also prefer to use the CustomAnalyzer for something like this, as it lets me build an analyzer very simply.

import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

This does the following:

  1. It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer, which takes care of tokenizing on Unicode whitespace, and handling punctuation.

  2. It applies the stopword list. Note the use of true for the ignoreCase attribute, and the reference to the stop-word file. The format of wordset means "one word per line" (there are other formats, also).

The key here is that there is nothing in the above chain which changes word case.

So, now, using this new analyzer, the output is as follows:

[Bla]

Final Notes

Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.

But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).

I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:

    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 

This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.

Final note - I did not investigate why you were getting that truncated [thi] token. If I get a chance I will take a closer look.


Follow-Up Questions

After combining I have to use the StandardAnalyzer, right?

Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.

I want to keep the stop word file on a specific non-imported path - how to do that?

You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):

import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

Instead of using .builder() we now use .builder(resources).



来源:https://stackoverflow.com/questions/64321901/using-default-and-custom-stop-words-with-apaches-lucene-weird-output

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!