Truncating Strings by Bytes

后端未结

关注

 13  1722

I create the following for truncating a string in java to a new string with a given number of bytes.

        String truncatedValue = \"\";
        String curren


                      
              相关标签:


      
      
        
          13条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2021-02-06 04:42
              
            
            
                                                                       
This one could not be the more efficient solution but works


public static String substring(String s, int byteLimit) {
    if (s.getBytes().length <= byteLimit) {
        return s;
    }

    int n = Math.min(byteLimit-1, s.length()-1);
    do {
        s = s.substring(0, n--);
    } while (s.getBytes().length > byteLimit);

    return s;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2021-02-06 04:44
              
            
            
                                                                       
As noted, Peter Lawrey solution has major performance disadvantage (~3,500msc for 10,000 times), Rex Kerr was much better (~500msc for 10,000 times) but the result not was accurate - it cut much more than it needed (instead of remaining 4000 bytes it remainds 3500 for some example). attached here my solution (~250msc for 10,000 times) assuming that UTF-8 max length char in bytes is 4 (thanks WikiPedia):

public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
    double MAX_UTF8_CHAR_LENGTH = 4.0;
    if(word.length()>dbLimit){
        word = word.substring(0, dbLimit);
    }
    if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
        int residual=word.getBytes("UTF-8").length-dbLimit;
        if(residual>0){
            int tempResidual = residual,start, end = word.length();
            while(tempResidual > 0){
                start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
                tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
                end=start;
            }
            word = word.substring(0, end);
        }
    }
    return word;
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-02-06 04:47
              
            
            
                                                                       
I think Rex Kerr's solution has 2 bugs.


First, it will truncate to limit+1 if a non-ASCII character is just before the limit. Truncating "123456789á1" will result in "123456789á" which is represented in 11 characters in UTF-8.
Second, I think he misinterpreted the UTF standard. https://en.wikipedia.org/wiki/UTF-8#Description shows that a 110xxxxx at the beginning of a UTF sequence tells us that the representation is 2 characters long (as opposed to 3). That's the reason his implementation usually doesn't use up all available space (as Nissim Avitan noted).


Please find my corrected version below:

public String cut(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return s;
    }
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < charLimit) {
        // Unicode characters above U+FFFF need 2 words in utf16
        extraLong = ((utf8[i] & 0xF0) == 0xF0);
        if ((utf8[i] & 0x80) == 0) {
            i += 1;
        } else {
            int b = utf8[i];
            while ((b & 0x80) > 0) {
                ++i;
                b = b << 1;
            }
        }
        if (i <= charLimit) {
            n16 += (extraLong) ? 2 : 1;
        }
    }
    return s.substring(0, n16);
}


I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:

private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return utf8;
    }
    if ((utf8[charLimit] & 0x80) == 0) {
        // the limit doesn't cut an UTF-8 sequence
        return Arrays.copyOf(utf8, charLimit);
    }
    int i = 0;
    while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
        ++i;
    }
    if ((utf8[charLimit-i-1] & 0x80) > 0) {
        // we have to skip the starter UTF-8 byte
        return Arrays.copyOf(utf8, charLimit-i-1);
    } else {
        // we passed all UTF-8 bytes
        return Arrays.copyOf(utf8, charLimit-i);
    }
}


Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IF you create a string from the byte array again.

Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2021-02-06 04:50
              
            
            
                                                                       
By using below Regular Expression also you can remove leading and trailing white space of double byte character.

stringtoConvert = stringtoConvert.replaceAll("^[\\s　]*", "").replaceAll("[\\s　]*$", "");

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-02-06 04:50
              
            
            
                                                                       
Binary search approach in scala:

private def bytes(s: String) = s.getBytes("UTF-8")

def truncateToByteLength(string: String, length: Int): String =
  if (length <= 0 || string.isEmpty) ""
  else {
    @tailrec
    def loop(badLen: Int, goodLen: Int, good: String): String = {
      assert(badLen > goodLen, s"""badLen is $badLen but goodLen is $goodLen ("$good")""")
      if (badLen == goodLen + 1) good
      else {
        val mid = goodLen + (badLen - goodLen) / 2
        val midStr = string.take(mid)
        if (bytes(midStr).length > length)
          loop(mid, goodLen, good)
        else
          loop(badLen, mid, midStr)
      }
    }

    loop(string.length * 2, 0, "")
  }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2021-02-06 04:52
              
            
            
                                                                       
Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?

Or you could just cut the original string if you keep track of where the cut should occur:

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    int advance = 1;
    int i = 0;
    while (i < n) {
      advance = 1;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
      else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
      else { i += 4; advance = 2; }
      if (i <= n) n16 += advance;
    }
    return s.substring(0,n16);
  }
}


^{Note: edited to fix bugs on 2014-08-25}
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复