I'm using Lucene for my project and I need a custom Analyzer.
Code is:
public class MyCommentAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {
Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
TokenStream filter = new StandardFilter( Version.LUCENE_48, source );
filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );
return new TokenStreamComponents( source, filter );
}
}
I've built it, but now I can't go on. My needs is that the filter must select only certain words. Like an opposite process compared to use stopwords: don't remove from a wordlist, but add only the terms in the wordlist. Like a prebuilt dictionary. So the StopFilter doesn't fill the target. And none of the filters Lucene provides seems good. I think I need to write my own filter, but don't know how.
Any suggestion?
You're right to look to StopFilter
for a starting point, so read the source!
Most of StopFilter
's source is all convenience methods for building the stopset. You can safely ignore all that (unless you want to keep it around for building your keep set).
Cut all that, and StopFilter
boils down to:
public final class StopFilter extends FilteringTokenFilter {
private final CharArraySet stopWords;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
super(matchVersion, in);
this.stopWords = stopWords;
}
@Override
protected boolean accept() {
return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}
FilteringTokenFilter
is a pretty simple class to implement. The key is just the accept
method. When it's called for the current term, if it returns true, the term is added to the output stream. If it returns false, the current term is discarded.
So the only thing you really need to change in StopFilter
is to delete a single character, to make accept
return the opposite of what it currently does. Wouldn't hurt to change a few names here and there, as well.
public final class KeepOnlyFilter extends FilteringTokenFilter {
private final CharArraySet keepWords;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
super(matchVersion, in);
this.keepWords = keepWords;
}
@Override
protected boolean accept() {
return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}
来源:https://stackoverflow.com/questions/24145688/how-to-tokenize-only-certain-words-in-lucene