Efficient way to calculate byte length of a character, depending on the encoding

后端未结

关注

 4  884

What\'s the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8

相关标签:

4条回答

庸人自扰

2021-02-06 01:10

Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.

0 讨论(0)
发布评论:

提交评论
- 加载中...

渐次进展

2021-02-06 01:12

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.

On my system, the following code takes 25 seconds to encode 100,000 single characters:

Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        int len = new String(array).getBytes(utf8).length;
    }
}

However, the following code does the same thing in under 4 seconds:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
    for (array[0] = 0; array[0] < 10000; array[0]++) {
        output.clear();
        input.clear();
        encoder.encode(input, output, false);
        int len = output.position();
    }
}

Edit: Why do haters gotta hate?

Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:

Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);

int limit = input.limit();
while(input.position() < limit) {
    output.clear();
    input.mark();
    input.limit(Math.max(input.position() + 2, input.capacity()));
    if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
        //Malformed surrogate pair; do something!
    }
    input.limit(input.position());
    input.reset();
    encoder.encode(input, output, false);
    int encodedLen = output.position();
}

0 讨论(0)

走了就别回头了

2021-02-06 01:12

If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.

Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".

0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2021-02-06 01:18

It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.

(For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)

0 讨论(0)
发布评论:

提交评论
- 加载中...