I\'m removing stop words from a String, using Apache\'s Lucene (8.6.3) and the following Java 8 code:
private static final String CONTENTS = "contents";
I will tackle this in two parts:
Handling the Combined Stop Words
To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:
import org.apache.lucene.analysis.en.EnglishAnalyzer;
...
final List stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
The above code simply takes the English stopwords bundled with Lucene and merges then with your list.
That gives the following output:
[bla]
Handling Word Case
This is a bit more involved. As you have noticed, the StandardAnalyzer
includes a step in which all words are converted to lower case - so we can't use that.
Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.
So, let's assume you have a file called stopwords.txt
. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.
You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).
My test file is just this:
short
this
is
a
test
the
him
it
I also prefer to use the CustomAnalyzer
for something like this, as it lets me build an analyzer very simply.
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
This does the following:
It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
, which takes care of tokenizing on Unicode whitespace, and handling punctuation.
It applies the stopword list. Note the use of true
for the ignoreCase
attribute, and the reference to the stop-word file. The format of wordset
means "one word per line" (there are other formats, also).
The key here is that there is nothing in the above chain which changes word case.
So, now, using this new analyzer, the output is as follows:
[Bla]
Final Notes
Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.
But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).
I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:
src/main/java
**/*.java
This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.
Final note - I did not investigate why you were getting that truncated [thi]
token. If I get a chance I will take a closer look.
Follow-Up Questions
After combining I have to use the StandardAnalyzer, right?
Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.
I want to keep the stop word file on a specific non-imported path - how to do that?
You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path resources = Paths.get("/path/to/resources/directory");
Analyzer analyzer = CustomAnalyzer.builder(resources)
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
Instead of using .builder()
we now use .builder(resources)
.