Java Unicode Confusion

前端未结

关注

 5  1035

无人及你

HEy all, I have only just started attempting to learn Java and have run into something that is really confusing!

I was typing out an example from the book I am using

相关标签:

5条回答

时光取名叫无心

2020-12-18 05:38

The \u00ab character is not the 1/2 character; see this definitive code page from the Unicode.org website.

What you are seeing is (I think) a consequence of using the System.out PrintStream on a platform where the default character encoding is not UTF-8 or Latin-1. Maybe it is some Windows character set as suggested by @axtavt's answer? (It also has a plausible explanation of why \u00ab is displayed as 1/2 ... and not some "splat" character.)

(In Unicode and Latin-1, \00BD is the codepoint for the 1/2 character.)

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2020-12-18 05:57

Well, when I use that code I get the << as I should and 1/2 for \u00BD as it should be.

http://www.unicode.org/charts/

0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-12-18 05:59
One thing great about Java is that it is unicode based. That means, you can use characters from writing systems that are not english alphabets (e.g. Chinese or math symbols), not just in data strings, but in function and variable names too.

Here's a example code using unicode characters in class names and variable names.
```
class 方 {
    String 北 = "north";
    double π = 3.14159;
}

class UnicodeTest {
    public static void main(String[] arg) {
        方 x1 = new 方();
        System.out.println( x1.北 );
        System.out.println( x1.π );
    }
}
```
Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to be able to represent a 16-bit Unicode code point.

The UTF-8 charset is specified by RFC 2279;

The UTF-16 charsets are specified by RFC 2781

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:
```
When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
```
Also see this
0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2020-12-18 06:01
0xAB is 1/2 in good old Codepage 437, which is what Windows terminals will use by default, no matter what codepage you actually set.

So, in fact, the char value represents the "«" character to a Java program, and if you render that char in a GUI or run it on a sane operating system, you will get that character. If you want to see proper output in Windows as well, switch your Font settings in CMD away from "Raster Fonts" (click top-left icon, Properties, Font tab). For example, with Lucida Console, I can do this:
```
C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is ½    

C:\Users\Documents>chcp 1252
Active code page: 1252

C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is «

C:\Users\Documents>chcp 437
Active code page: 437
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-12-18 06:02

It's a well-known problem with console encoding mismatch on Windows platforms.

Java Runtime expects that encoding used by the system console is the same as the system default encoding. However, Windows uses two separate encodings: ANSI code page (system default encoding) and OEM code page (console encoding).

So, when you try to write Unicode character U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK to the console, Java runtime expects that console encoding is the ANSI encoding (that is Windows-1252 in your case), where this Unicode character is represented as 0xAB. However, the actual console encoding is the OEM encoding (CP437 in your case), where 0xAB means ½.

Therefore printing data to Windows console with System.out.println() produces wrong results.

To get correct results you can use System.console().writer().println() instead.

0 讨论(0)
发布评论:

提交评论
- 加载中...