Will String.getBytes(“UTF-16”) return the same result on all platforms?

后端 未结 3 911
执念已碎
执念已碎 2021-01-18 17:46

I need to create a hash from a String containing users password. To create the hash, I use a byte array which I get by calling String.getBytes(). But when I cal

3条回答
  •  迷失自我
    2021-01-18 18:09

    It is true, java uses Unicode internally so it may combine any script/language. String and char use UTF-16BE but .class files store there String constants in UTF-8. In general it is irrelevant what String does, as there is a conversion to bytes specifying the encoding the bytes have to be in.

    If this encoding of the bytes cannot represent some of the Unicode characters, a placeholder character or question mark is given. Also fonts might not have all Unicode characters, 35 MB for a full Unicode font is a normal size. You might then see a square with 2x2 hex codes or so for missing code points. Or on Linux another font might substitute the char.

    Hence UTF-8 is a perfect fine choice.

    String s = ...;
    if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM
        s = "\uFEFF" + s;
    }
    byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    

    Both UTF-16 (in both byte orders) and UTF-8 always are present in the JRE, whereas some Charsets are not. Hence you can use a constant from StandardCharsets not needing to handle any UnsupportedEncodingException.

    Above I added a BOM for Windows Notepad esoecially, to recognize UTF-8. It certainly is not good practice. But as a small help here.

    There is no disadvantage to UTF16-LE or UTF-16BE. I think UTF-8 is a bit more universally used, as UTF-16 also cannot store all Unicode code points in 16 bits. Text is Asian scripts would be more compressed, but already HTML pages are more compact in UTF-8 because of the HTML tags and other latin script.

    For Windows UTF-16LE might be more native.

    Problem with placeholders for non-Unicode platforms, especially Windows, might happen.

提交回复
热议问题