Is there a way to integrate PorterStemFilter
into StandardAnalyzer
in Lucene, or do I have to copy/paste StandardAnalyzers
source code, and add the filter, since StandardAnalyzer
is defined as final class. Is there any smarter way?
Also, if I would like not to consider numbers, how can I achieve that?
Thanks
If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer
. Otherwise, you could create a new Analyzer
that extends the AnalyzerWraper
as shown below.
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.TypeTokenFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class PorterAnalyzer extends AnalyzerWrapper {
private Analyzer baseAnalyzer;
public PorterAnalyzer(Analyzer baseAnalyzer) {
this.baseAnalyzer = baseAnalyzer;
}
@Override
public void close() {
baseAnalyzer.close();
super.close();
}
@Override
protected Analyzer getWrappedAnalyzer(String fieldName)
{
return baseAnalyzer;
}
@Override
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components)
{
TokenStream ts = components.getTokenStream();
Set<String> filteredTypes = new HashSet<>();
filteredTypes.add("<NUM>");
TypeTokenFilter numberFilter = new TypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);
PorterStemFilter porterStem = new PorterStemFilter(numberFilter);
return new TokenStreamComponents(components.getTokenizer(), porterStem);
}
public static void main(String[] args) throws IOException
{
//Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
PorterAnalyzer analyzer = new PorterAnalyzer(new StandardAnalyzer(Version.LUCENE_46));
String text = "This is a testing example. It should tests the Porter stemmer version 111";
TokenStream ts = analyzer.tokenStream("fieldName", new StringReader(text));
ts.reset();
while (ts.incrementToken()){
CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);
System.out.println(ca.toString());
}
analyzer.close();
}
}
The code above is based on this lucene forum thread's. The main work is implemented by the wrapComponents method. You first get the TokenStream object from the wrapped analyzer, you then shoud apply a type filter to ignore numerical tokens. Lastly, you apply the porter stemmer filter. I hope it is clear.
来源:https://stackoverflow.com/questions/25714455/standardanalyzer-with-stemming