Truncating Strings by Bytes

后端 未结 13 1735
醉酒成梦
醉酒成梦 2021-02-06 04:21

I create the following for truncating a string in java to a new string with a given number of bytes.

        String truncatedValue = \"\";
        String curren         


        
13条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-06 04:47

    I think Rex Kerr's solution has 2 bugs.

    • First, it will truncate to limit+1 if a non-ASCII character is just before the limit. Truncating "123456789á1" will result in "123456789á" which is represented in 11 characters in UTF-8.
    • Second, I think he misinterpreted the UTF standard. https://en.wikipedia.org/wiki/UTF-8#Description shows that a 110xxxxx at the beginning of a UTF sequence tells us that the representation is 2 characters long (as opposed to 3). That's the reason his implementation usually doesn't use up all available space (as Nissim Avitan noted).

    Please find my corrected version below:

    public String cut(String s, int charLimit) throws UnsupportedEncodingException {
        byte[] utf8 = s.getBytes("UTF-8");
        if (utf8.length <= charLimit) {
            return s;
        }
        int n16 = 0;
        boolean extraLong = false;
        int i = 0;
        while (i < charLimit) {
            // Unicode characters above U+FFFF need 2 words in utf16
            extraLong = ((utf8[i] & 0xF0) == 0xF0);
            if ((utf8[i] & 0x80) == 0) {
                i += 1;
            } else {
                int b = utf8[i];
                while ((b & 0x80) > 0) {
                    ++i;
                    b = b << 1;
                }
            }
            if (i <= charLimit) {
                n16 += (extraLong) ? 2 : 1;
            }
        }
        return s.substring(0, n16);
    }
    

    I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:

    private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
        byte[] utf8 = s.getBytes("UTF-8");
        if (utf8.length <= charLimit) {
            return utf8;
        }
        if ((utf8[charLimit] & 0x80) == 0) {
            // the limit doesn't cut an UTF-8 sequence
            return Arrays.copyOf(utf8, charLimit);
        }
        int i = 0;
        while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
            ++i;
        }
        if ((utf8[charLimit-i-1] & 0x80) > 0) {
            // we have to skip the starter UTF-8 byte
            return Arrays.copyOf(utf8, charLimit-i-1);
        } else {
            // we passed all UTF-8 bytes
            return Arrays.copyOf(utf8, charLimit-i);
        }
    }
    

    Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IF you create a string from the byte array again.

    Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.

提交回复
热议问题