Difference between UTF-8 and UTF-16? Why do we need these?
MessageDigest md = MessageDigest.getInstance(\"SHA-256\");
String text = \"This is some text\";
m
They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)
Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.
Difference between UTF-8 and UTF-16? Why do we need these?
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Main UTF-8 cons:
Main UTF-16 pros:
char
as the primitive component of the string.Main UTF-16 cons:
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.
This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
static byte[] toBytes(String s){
byte[] b=new byte[s.length()*2];
ByteBuffer.wrap(b).asCharBuffer().put(s);
return b;
}