What occurs when a string is converted to a byte array

风格不统一 提交于 2019-12-06 14:24:28

In Java, strings are stored as an array of 16-bit char values. Each Unicode character in the string is stored as one or (rarely) two char values in the array.

If you want to store some string data in a byte array, you will need to be able to convert the string's Unicode characters into a sequence of bytes. This process is called encoding and there are several ways to do it, each with different rules and results. If two pieces of code want to share string data using byte arrays, they need to agree on which encoding is being used.

For example, suppose we have a string s that we want to encode using the UTF-8 encoding. UTF-8 has the convenient property that if you use it to encode a string that contains only ASCII characters, every character in the input gets converted to a single byte with that character's ASCII value. We might convert our Java string to a Java byte array as follows:

byte[] bytes = s.getBytes("UTF-8");

The byte array bytes now contains the string data from s, encoded into bytes using the UTF-8 encoding.

Now, we store or transmit the bytes somewhere, and the code on the other end wants to decode the bytes back into a Java String. It will do something like the following:

String t = new String(bytes, "UTF-8");

Assuming nothing went wrong, the string t now contains the same string data as the original string s.

Note that both pieces of code had to agree on what encoding was being used. If they disagreed, the resulting string might end up containing garbage, or might even fail to decode at all.

String is encoded into bytearray according to a Charset. A charset can encode a char into more or less bits and then, bytes.

For example if you have to display only ciphres (10 different charcters) you may use a charset defining 4 bits per character, obtaining a 2 characters per byte representation. Charset of the OS is often choosed by default in String to byteArray encoders. To obtain back the string you have to decode that string with the same charset.

You are not barking mad. The key to remember in all matters String, is that to the computer, characters do not exist, only numbers exist. There is no such thing as a character, String, text or similar that isn't actually implemented through storing numbers (actually that goes for all data types: booleans are really numbers with very small range, enums are internally numbers, etc.) This is why it is meaningless to say that a piece of data represents "A" or any other character, you must know what character encoding the surrounding code assumes.

Converting Strings into byte arrays occurs precisely at this boundary between the intentional perspective ("This should print as 'A'") and the internal perspective ("This memory cell contains a 65"). Therefore, to get the right result, you must convert between them according to one of several possible character sets, and choose the right one. Note that the JDK offers convenience methods that do not require a charset name and always use the default charset deduced from your platform and environment variables; but it is almost always a better idea to know what you're doing and state the charset explicitly, rather than code something that works today and mysteriously fails when you execute it on another machine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!