What\'s the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8
Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.
On my system, the following code takes 25 seconds to encode 100,000 single characters:
Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
int len = new String(array).getBytes(utf8).length;
}
}
However, the following code does the same thing in under 4 seconds:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
output.clear();
input.clear();
encoder.encode(input, output, false);
int len = output.position();
}
}
Edit: Why do haters gotta hate?
Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);
int limit = input.limit();
while(input.position() < limit) {
output.clear();
input.mark();
input.limit(Math.max(input.position() + 2, input.capacity()));
if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
//Malformed surrogate pair; do something!
}
input.limit(input.position());
input.reset();
encoder.encode(input, output, false);
int encodedLen = output.position();
}