How to tokenize only certain words in Lucene

后端未结

关注

 1  1133

I\'m using Lucene for my project and I need a custom Analyzer.

Code is:

public class MyCommentAnalyzer extends Analyzer {

@Override
    protected To


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  逝去的感伤        
                
              
                            
                2021-01-16 13:28
              
            
            
                                                                       
You're right to look to StopFilter for a starting point, so read the source!

Most of StopFilter's source is all convenience methods for building the stopset.  You can safely ignore all that (unless you want to keep it around for building your keep set).

Cut all that, and StopFilter boils down to:

public final class StopFilter extends FilteringTokenFilter {

    private final CharArraySet stopWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
        super(matchVersion, in);
        this.stopWords = stopWords;
    }

    @Override
    protected boolean accept() {
        return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}


FilteringTokenFilter is a pretty simple class to implement.  The key is just the accept method.  When it's called for the current term, if it returns true, the term is added to the output stream.  If it returns false, the current term is discarded.

So the only thing you really need to change in StopFilter is to delete a single character, to make accept return the opposite of what it currently does.  Wouldn't hurt to change a few names here and there, as well.

public final class KeepOnlyFilter extends FilteringTokenFilter {

    private final CharArraySet keepWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
        super(matchVersion, in);
        this.keepWords = keepWords;
    }

    @Override
    protected boolean accept() {
        return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复