Search with various combinations of space, hyphen, casing and punctuations

前端未结
关注
 4  1205
迷失自我 2021-01-04 08:00
My schema:

      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   伪装坚强ぢ
                                             
                
                
                (楼主)
            
              
              
                2021-01-04 08:54
              

            
            
                        

  Why does "WalMart" not match "Walmart" with my initial schema?


Because you have defined the mm parameter of your DisMax/eDismax handler with a too high value. I have played around with it. When you define the mm value to 100% you will get no match. But why?

Because you are using the same analyzer for query and index time. Your search term "WalMart" is separated into 3 tokens (words). Namely these are "wal", "mart" and "walmart". Solr will now treat each word individually when counting towards the 100%*.

By the way I have reproduced your problem, but there the problem occurs when indexing Walmart, but querying with WalMart. When performing it the other way around, it works fine.

You can override this by using LocalParams, you could rephrase your query like this {!mm=1}WalMart.


  There are more slightly complex ones like [ ... ] "Mc Donald's" [ to match ] Words with different punctuations: "Mc-Donald Engineering Company, Inc."


Here also playing with the mm parameter helps.


  In general, what's the best way to go around modeling the schema with this kind of requirement?


Here I agree with Sujit Pal, you should go and implement an own copy of the SynonymFilter. Why? Because it works differently from the other filters and tokenizers. It creates tokens inplace the offset of the indexed words. 

What inplace? It will not increase the token count of your query. And you can perform the back hyphenation (joining two words that are separated by a blank).


  But we are lacking a good synonyms.txt and cannot keep it up-to-date.


When extending or copying the SynonymFilter ignore the static mapping. You may remove the code that maps the words. You just need the offset handling.

Update I think you can also try the PatternCaptureGroupTokenFilter, but tackling company names with regular expressions may soon face its' limits. I will have a look into this later.



* You can find this in your solrconfig.xml, have a look for your 
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复