Hibernate Search: How to use wildcards correctly?

后端未结

关注

 1  870

I have the following query to search patients by full name, for an specific medical center:

MustJunction mj = qb.bool().must(qb.keyword()
    .onField(\"medi


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-11-27 23:20
              
            
            
                                                                       
Short answer: don't use wildcard queries, use a custom analyzer with an EdgeNGramFilterFactory. Also, don't try to analyze the query yourself (that's what you did by splitting the query into terms): Lucene will do it much better (with a WhitespaceTokenizerFactory, an ASCIIFoldingFilterFactory and a LowercaseFilterFactory in particular).

Long answer:

Wildcard queries are useful as quick and easy solutions to one-time problems, but they are not very flexible and reach their limits quite quickly. In particular, as @femtoRgon mentioned, these queries are not analyzed, so an uppercase query won't match a lowercase name, for instance.

The classic solution to most problems in the Lucene world is to use specially-crafted analyzers at index time and query time (not necessarily the same). In your case, you will want to use this kind of analyzer when indexing:

@AnalyzerDef(name = "edgeNgram",
    tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
            @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
            @TokenFilterDef(factory = LowerCaseFilterFactory.class), // Lowercase all characters
            @TokenFilterDef(
                    factory = EdgeNGramFilterFactory.class, // Generate prefix tokens
                    params = {
                            @Parameter(name = "minGramSize", value = "1"),
                            @Parameter(name = "maxGramSize", value = "10")
                    }
            )
    })


And this kind when querying:

@AnalyzerDef(name = "edgeNGram_query",
    tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class),
    filters = {
            @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class), // Replace accented characeters by their simpler counterpart (è => e, etc.)
            @TokenFilterDef(factory = LowerCaseFilterFactory.class) // Lowercase all characters
    })


The index analyzer will transform "Mauricio Ubilla Carvajal" to this list of tokens:


m
ma
mau
maur
mauri
mauric
maurici
mauricio
u
ub
...
ubilla
c
ca
...
carvajal


And the query analyzer will turn the query "mau UB" into ["mau", "ub"], which will match the indexed name (both tokens are present in the index).

Note that you'll obviously have to assign the analyzer to the field. For the indexing part, it's done using the @Analyzer annotation.
For the query part, you'll have to use overridesForField on the query builder as shown here:

QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Hospital.class)
    .overridesForField( "name", "edgeNGram_query" )
    .get();
// Then it's business as usual


Also note that, in Hibernate Search 5, Elasticsearch analyzer definitions are only generated by Hibernate Search if they are actually assigned to an index. So the query analyzer definition will not, by default, be generated, and Elasticsearch will complain that it does not know the analyzer. Here is a workaround: https://discourse.hibernate.org/t/cannot-find-the-overridden-analyzer-when-using-overridesforfield/1043/4?u=yrodiere
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复