Efficient way to calculate byte length of a character, depending on the encoding

后端 未结 4 884
予麋鹿
予麋鹿 2021-02-06 00:48

What\'s the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8

相关标签:
4条回答
  • 2021-02-06 01:10

    Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.

    0 讨论(0)
  • 2021-02-06 01:12

    Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

    On my system, the following code takes 25 seconds to encode 100,000 single characters:

    Charset utf8 = Charset.forName("UTF-8");
    char[] array = new char[1];
    for (int reps = 0; reps < 10000; reps++) {
        for (array[0] = 0; array[0] < 10000; array[0]++) {
            int len = new String(array).getBytes(utf8).length;
        }
    }
    

    However, the following code does the same thing in under 4 seconds:

    Charset utf8 = Charset.forName("UTF-8");
    CharsetEncoder encoder = utf8.newEncoder();
    char[] array = new char[1];
    CharBuffer input = CharBuffer.wrap(array);
    ByteBuffer output = ByteBuffer.allocate(10);
    for (int reps = 0; reps < 10000; reps++) {
        for (array[0] = 0; array[0] < 10000; array[0]++) {
            output.clear();
            input.clear();
            encoder.encode(input, output, false);
            int len = output.position();
        }
    }
    

    Edit: Why do haters gotta hate?

    Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

    Charset utf8 = Charset.forName("UTF-8");
    CharsetEncoder encoder = utf8.newEncoder();
    CharBuffer input = //allocate in some way, or pass as parameter
    ByteBuffer output = ByteBuffer.allocate(10);
    
    int limit = input.limit();
    while(input.position() < limit) {
        output.clear();
        input.mark();
        input.limit(Math.max(input.position() + 2, input.capacity()));
        if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
            //Malformed surrogate pair; do something!
        }
        input.limit(input.position());
        input.reset();
        encoder.encode(input, output, false);
        int encodedLen = output.position();
    }
    
    0 讨论(0)
  • 2021-02-06 01:12

    If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.

    Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".

    0 讨论(0)
  • 2021-02-06 01:18

    It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.

    (For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)

    0 讨论(0)
提交回复
热议问题