How to correctly compute the length of a String in Java?

后端未结

关注

 5  811

I know there is String#length and the various methods in Character which more or less work on code units/code points.

What is the suggested

相关标签:

5条回答

梦如初夏

2020-12-08 03:24

java.text.BreakIterator is able to iterate over text and can report on "character", word, sentence and line boundaries.

Consider this code:

def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
  val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
  charIterator.setText(text)

  var result = 0
  while(charIterator.next() != BreakIterator.DONE) result += 1
  result
}

Running it:

scala> val text = "Thîs lóo̰ks we̐ird!"
text: java.lang.String = Thîs lóo̰ks we̐ird!

scala> val length = length(text)
length: Int = 17

scala> val codepoints = text.codePointCount(0, text.length)
codepoints: Int = 21

With surrogate pairs:

scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
parens: java.lang.String =


          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-08 03:27
              
            
            
                                                                       
It depends on exactly what you mean by "length of [the] String":


String.length() returns the number of chars in the String. This is normally only useful for programming related tasks like allocating buffers because multi-byte encoding can cause problems which means one char doesn't mean one Unicode code point.
String.codePointCount(int, int) and Character.codePointCount(CharSequence,int,int) both return the number of Unicode code points in the String. This is normally only useful for programming related tasks that require looking at a String as a series of Unicode code points without needing to worry about multi-byte encoding interfering.
BreakIterator.getCharacterInstance(Locale) can be used to get the next grapheme in a String for the given Locale. Using this multiple times can allow you to count the number of graphemes in a String. Since graphemes are basically letters (in most circumstances) this method is useful for getting the number of writable characters the String contains. Essentially this method returns approximately the same number you would get if you manually counted the number of letters in the String, making it useful for things like sizing user interfaces and splitting Strings without corrupting the data.


To give you an idea of how each of the different methods can return different lengths for the exact same data, I created this class to quickly generate the lengths of the Unicode text contained within this page, which is designed to offer a comprehensive test of many different languages with non-English characters. Here is the results of executing that code after normalizing the input file in three different ways (no normalizing, NFC, NFD):

Input UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFC Normalized UTF-8 String
>>  String.length() = 3431
>>  String.codePointCount(int,int) = 3431
>>  BreakIterator.getCharacterInstance(Locale) = 3386
NFD Normalized UTF-8 String
>>  String.length() = 3554
>>  String.codePointCount(int,int) = 3554
>>  BreakIterator.getCharacterInstance(Locale) = 3386


As you can see, even the "same-looking" String could give different results for the length if you use either String.length() or String.codePointCount(int,int). 

For more information on this topic and other similar topics you should read this blog post that covers a variety of basics on using Java to properly handle Unicode.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北荒        
                
              
                            
                2020-12-08 03:31
              
            
            
                                                                       
The normal model of Java string length

String.length() is specified as returning the number of char values ("code units") in the String.  That is the most generally useful definition of the length of a Java String; see below.

Your description¹ of the semantics of length based on the size of the backing array/array slice is incorrect.  The fact that the value returned by length() is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries.  String does not need to be implemented that way.  Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.



Alternative models of string length.

To get the number of Unicode codepoints in a String use str.codePointCount(0, str.length()) -- see the javadoc.

To get the size (in bytes) of a String in some other encoding use str.getBytes(charset).length.

To deal with locale-specific issues, you can use Normalizer to normalize the String to whatever form is most appropriate to your use-case, and then use codePointCount as above.

But in some cases, even this won't work; e.g. the Hungarian letter counting rules which the Unicode standard apparently doesn't cater for.



Using String.length() is generally OK

The reason that most applications use String.length() is that most applications are not concerned with counting the number of characters in words, texts, etcetera in a human-centric way.  For instance, if I do this:

String s = "hi mum how are you";
int pos = s.indexOf("mum");
String textAfterMum = s.substring(pos + "mum".length());


it really doesn't matter that "mum".length() is not returning code points or that it is not a linguistically correct character count.  It is measuring the length of the string using the model that is appropriate to the task at hand.  And it works.  

Obviously, things get a bit more complicated when you do multilingual text analysis; e.g. searching for words.  But even then, if you normalize your text and parameters before you start, you can safely code in terms of "code units" rather than "code points" most of the time; i.e. length() still works.



^{1 - This description was on some versions of the question.  See the edit history ... if you have sufficient rep points.}
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-12-08 03:45
              
            
            
                                                                       
String.length() does not return the size of the array backing the string, but the actual length of the string, defined as "the number of Unicode code units in the string." (see API docs).

(As pointed out by Stephen C in the comments, Unicode code units == Java chars)

If this is not what you are looking for, then perhaps you should elaborate the question a bit more.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-12-08 03:48
              
            
            
                                                                       
If you mean, counting the length of a string according to the grammatical rules of a language, then the answer is no, there's no such algorithm in Java, nor anywhere else.

Not unless the algorithm also does a full semantic analysis of the text.

In Hungarian for example sz and zs can count as one letter or two, which depends on the composition of the word they appear in. (E.g.: ország is 5 letters, whereas torzság is 7.)

Uodate: If all you want is the Unicode standard character count (which, as I pointed out, isn't accurate), transforming your string to the NFKC form with java.text.Normalizer could be a solution.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...