SOLR and accented characters

后端未结

关注

 3  2039

I have an index for occupations (identifier + occupation):


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-01-27 07:32
              
            
            
                                                                       
I don't think mysql or your jvm settings have anything to do with this. I suspect one works and the other does not probably due to the SpanishLightStemFilterFactory. 

The right way to achieve matching no matter the diacritics is to use the following:

  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>


Put that before your tokenizer in both index and query analyzer chains, and any diacritic should be converted to the ascii version. That would make it work always.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-01-27 07:34
              
            
            
                                                                       
Ok, I have discovered the source problem. I have opened my SQL load script with VI, in hex mode.

This is the hex content for 'Agrónomo' in an INSERT statement: 41 67 72 6f cc 81 6e 6f 6d 6f.

6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!


So that's the problem... It must be "c3 b3"... I get the literals copy/pasting from a web page, so the source characters on the origin was the problem.

Thanks to both of you, because I have learning more about SOLR's soul.

Regards.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2021-01-27 07:36
              
            
            
                                                                       
Just add solr.ASCIIFoldingFilterFactory to your filter analyzer chain or even better create a new fieldType:

<!-- Spanish -->
<fieldType name="text_es_ascii_folding" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>



  This filter converts alphabetic, numeric, and symbolic Unicode
  characters which are not in the Basic Latin Unicode block (the first
  127 ASCII characters) to their ASCII equivalents, if one exists.


This should let you to match the search even if the accented character is missing.
The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复