Case-insensitive storage and unicode compatibility

前端未结

关注

 3  564

After I heard of someone at my work using String.toLowerCase() to store case-insensitive codes in a database for searchability, I had an epic fail moment thinki

相关标签:

3条回答

小鲜肉

2020-12-08 23:18

Specify a locale for toLowerCase() instead of using the system default. This protects against changes to the system locale.

As for possible unicode changes in future version of Java, I don't think it's worth writing code to handle this. Document that the product supports Java 6 and move on to a feature that your customers actually want.

0 讨论(0)
发布评论:

提交评论
- 加载中...

萌比男神i

2020-12-08 23:21

You do not want to store the lowercase version of a string "for searchability"!!

That is the wrong approach altogether. You are making unjust and incorrect assumptions about how Unicode casing works.

This is why Unicode defines a separate thing called a casefold for a string, distinct from the three different cases (lowercase, titlecase, and uppercase).

Here are ten different examples where you will do the wrong thing if you use the lowercase instead of the casefold:

ORIGINAL        CASEFOLD        LOWERCASE   TITLECASE  UPPERCASE
========================================================================
eﬃcient         efficient       eﬃcient       Eﬃcient         EFFICIENT       
ﬂour            flour           ﬂour           Flour           FLOUR           
poſt            post            poſt           Poſt            POST            
poﬅ             post            poﬅ             Poﬅ            POST            
ﬅop             stop            ﬅop            Stop            STOP            
tschüß          tschüss         tschüß         Tschüß         TSCHÜSS         
weiß            weiss           weiß           Weiß            WEISS           
WEIẞ            weiss           weiß            Weiß           WEIẞ            
στιγμας         στιγμασ         στιγμας         Στιγμας         ΣΤΙΓΜΑΣ 
ᾲ στο διάολο    ὰι στο διάολο   ᾲ στο διάολο    Ὰͅ Στο Διάολο   ᾺΙ ΣΤΟ ΔΙΆΟΛΟ

And yes, I know the plural of stigma is stigmata not stigmas; I am trying to show the final sigma issue. Both ς and σ are valid lowercase versions of the uppercase sigma, Σ. If you store “just the lowercase”, then you will get the wrong thing.

If you are using Java’s Pattern class, you must specify both CASE_INSENSITIVE and UNICODE_CASE, and you still will not get these right, because while Java uses full casemapping, it uses only simple casefolding. This is a problem.

As for the Turkic languages, yes, it is true that there is a special casefold for Turkic. For example, İstanbul has a Turkic casefold of just ı̇stanbul instead of the i̇stanbul that you are supposed to get. Since I am sure those will not look right to you, I’ll spell it out with named characters for the non-ASCII; in plainer terms, "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}stanbul" has a Turkic casefold of "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING DOT ABOVE}stanbul" rather than "i\N{COMBINING DOT ABOVE}stanbul" that you normally get.

Here are a couple more table rows if you’re writing a regression testing suite:

[ "Henry Ⅷ", "henry ⅷ", "henry ⅷ", "Henry Ⅷ", "HENRY Ⅷ",  ],
[ "I Work At Ⓚ",  "i work at ⓚ",  "i work at ⓚ", "I Work At Ⓚ", "I WORK AT Ⓚ", ],
[ "ʀᴀʀᴇ", "ʀᴀʀᴇ", "ʀᴀʀᴇ", "Ʀᴀʀᴇ", "ƦᴀƦᴇ",  ],
[ "Ԧԧ", "ԧԧ", "ԧԧ", "Ԧԧ", "ԦԦ",   ],
[ "


          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-12-08 23:33
              
            
            
                                                                       
I think the most long term solution is to


record the current default locale and technology stack version (in my case Java version) into configuration
if it's changed (since last start up, or running for locale - depending on how it's loaded by said technology stack), then lock the store and re-index all affected data sets.


Obviously, this needs to occur at the primary interface level; if I'm doing these changes in java, I better hope that it's my only data interface mechanism (e.g. that other techs are not querying the underlying table store)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...