I need to encode/decode UTF-16 byte arrays to and from java.lang.String
. The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded b
The "UTF-16" charset name will always encode with a BOM and will decode data using either big/little endianness, but "UnicodeBig" and "UnicodeLittle" are useful for encoding in a specific byte order. Use UTF-16LE or UTF-16BE for no BOM - see this post for how to use "\uFEFF" to handle BOMs manually. See here for canonical naming of charset string names or (preferably) the Charset class. Also take note that only a limited subset of encodings are absolutely required to be supported.
This is an old question, but still, I couldn't find an acceptable answer for my situation. Basically, Java doesn't have a built-in encoder for UTF-16LE with a BOM. And so, you have to roll out your own implementation.
Here's what I ended up with:
private byte[] encodeUTF16LEWithBOM(final String s) {
ByteBuffer content = Charset.forName("UTF-16LE").encode(s);
byte[] bom = { (byte) 0xff, (byte) 0xfe };
return ByteBuffer.allocate(content.capacity() + bom.length).put(bom).put(content).array();
}
First off, for decoding you can use the character set "UTF-16"; that automatically detects an initial BOM. For encoding UTF-16BE, you can also use the "UTF-16" character set - that'll write a proper BOM and then output big endian stuff.
For encoding to little endian with a BOM, I don't think your current code is too bad, even with the double allocation (unless your strings are truly monstrous). What you might want to do if they are is not deal with a byte array but rather a java.nio ByteBuffer, and use the java.nio.charset.CharsetEncoder class. (Which you can get from Charset.forName("UTF-16LE").newEncoder()).
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(string.length() * 2 + 2);
byteArrayOutputStream.write(new byte[]{(byte)0xFF,(byte)0xFE});
byteArrayOutputStream.write(string.getBytes("UTF-16LE"));
return byteArrayOutputStream.toByteArray();
EDIT: Rereading your question, I see you would rather avoid the double array allocation altogether. Unfortunately the API doesn't give you that, as far as I know. (There was a method, but it is deprecated, and you can't specify encoding with it).
I wrote the above before I saw your comment, I think the answer to use the nio classes is on the right track. I was looking at that, but I'm not familiar enough with the API to know off hand how you get that done.
This is how you do it in nio:
return Charset.forName("UTF-16LE").encode(message)
.put(0, (byte) 0xFF)
.put(1, (byte) 0xFE)
.array();
It is certainly supposed to be faster, but I don't know how many arrays it makes under the covers, but my understanding of the point of the API is that it is supposed to minimize that.