Efficient way to calculate byte length of a character, depending on the encoding

后端 未结 4 883
予麋鹿
予麋鹿 2021-02-06 00:48

What\'s the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8

4条回答
  •  渐次进展
    2021-02-06 01:12

    Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

    On my system, the following code takes 25 seconds to encode 100,000 single characters:

    Charset utf8 = Charset.forName("UTF-8");
    char[] array = new char[1];
    for (int reps = 0; reps < 10000; reps++) {
        for (array[0] = 0; array[0] < 10000; array[0]++) {
            int len = new String(array).getBytes(utf8).length;
        }
    }
    

    However, the following code does the same thing in under 4 seconds:

    Charset utf8 = Charset.forName("UTF-8");
    CharsetEncoder encoder = utf8.newEncoder();
    char[] array = new char[1];
    CharBuffer input = CharBuffer.wrap(array);
    ByteBuffer output = ByteBuffer.allocate(10);
    for (int reps = 0; reps < 10000; reps++) {
        for (array[0] = 0; array[0] < 10000; array[0]++) {
            output.clear();
            input.clear();
            encoder.encode(input, output, false);
            int len = output.position();
        }
    }
    

    Edit: Why do haters gotta hate?

    Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

    Charset utf8 = Charset.forName("UTF-8");
    CharsetEncoder encoder = utf8.newEncoder();
    CharBuffer input = //allocate in some way, or pass as parameter
    ByteBuffer output = ByteBuffer.allocate(10);
    
    int limit = input.limit();
    while(input.position() < limit) {
        output.clear();
        input.mark();
        input.limit(Math.max(input.position() + 2, input.capacity()));
        if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
            //Malformed surrogate pair; do something!
        }
        input.limit(input.position());
        input.reset();
        encoder.encode(input, output, false);
        int encodedLen = output.position();
    }
    

提交回复
热议问题