I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders()
method.
Now, in my java cod
This Website provide UTF TO UTF Conversion
http://www.fileformat.info/convert/text/utf2utf.htm
UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits and While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling.
HOWEVER
UTF-32 is the same as the old UCS-4 encoding and remains fixed width. Why can this remain fixed width? As UTF-16 is now the format that can encode the least amount of characters it set the limit for all formats. It was defined that 1,112,064 was the total number of code points that will ever be defined by either Unicode or ISO 10646. Since Unicode is now only defined from 0 to 10FFFF UTF-32 sounds a bit like a pointless encoding now as it's 32 bit wide, but only ever about 21 bits are used which makes this very wasteful.
There are two things:
You should not be preoccupied with the second point ;) The thing is to use the appropriate methods to convert from your data (byte arrays) to String
s (char
arrays ultimately), and to convert form String
s to your data.
The most basic classes you can think of are CharsetDecoder and CharsetEncoder. But there are plenty others. String.getBytes()
, all Reader
s and Writer
s are but two possible methods. And there are all static methods of Character
as well.
If you see gibberish at some point, it means you failed to decode or encode from the original byte data to Java strings. But again, the fact that Java strings use UTF-16 is not relevant here.
In particular, you should be aware that when you create a Reader
or Writer
, you should specify the encoding; if you fail to do so, the default JVM encoding will be used, and it may, or may not, be UTF-8.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Main UTF-8 cons:
Main UTF-16 pros:
Main UTF-16 cons:
In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocol