How does java handle unicode characters?

问题

I read this blogentry regarding perl and how they handle unicode and normalization of unicode. Short version, as I understand it, is that there are several ways to write the identifier "é" in unicode. Either as one unicode character or as a combination of two character. And the perl program may not be able to distinguish between them causing strange errors.

So that got me thinking, how does the Java editor in Eclipse handle unicode? Or java in general, since I guess thats the same question.

On one hand the specification says:

Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit.

But on the other, the unicode chars are translated:

This translation step allows any program to be expressed using only ASCII characters.

This seems to contradict each other?

回答1:

The translation step refers to the first step of the lexical translation process:

A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the Unicode character whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

The lexical translation process allows Unicode characters to be specified in your source code as escape sequences having ASCII characters alone. It is thereby possible for one to name an identifier with valid Unicode characters but represented in ASCII using an Unicode escape sequence.

The translation of escape sequences occurs before the compiler is invoked to produce the bytecode; it is the compiler that verifies whether two identifiers are alike, irrespective of how they are represented in code. The compiler is provided with a normalized sequence of input characters and line terminators, and the rules for naming identifiers are applied against this sequence. Therefore, the following code will not compile, and will produce an error, as the identifiers have the same name, despite one being represented differently:

package info.example.i18n;

public class UnicodeEscape
{
    int a;
    int \u0061; // Hex(61) = Dec(97) = 'a' in ASCII-7
}

回答2:

Expressing characters as Unicode escapes is distinct from Unicode combining characters.

as I understand it, is that there are several ways to write the identifier "é" in unicode. Either as one unicode character or as a combination of two character.

Specifically, é can be represented either by the single codepoint U+00E9 or the combining sequence U+0065 U+0301. These forms are NFC and NFD respectively and you can normalize between them.

The Java compiler does not perform normalization, so this is legal:

public class EAcute {
  int \u00E9;
  int \u0065\u0301; 
}

...even though expressed as literal graphemes there appears to be a conflict:

public class EAcute {
  int é;
  int é; 
}

Here is a hex dump of the latter form encoded as UTF-8:

0000000: 7075 626c 6963 2063 6c61 7373 2045 4163  public class EAc
0000010: 7574 6520 7b0a 2020 696e 7420 c3a9 3b0a  ute {.  int ..;.
0000020: 2020 696e 7420 65cc 813b 200a 7d0a         int e..; .}.

So, while é (C3A9) and \u00E9 or é (65CC82) and \u0065\u0301 are treated as equivalent by the compiler, other combinations are not.

回答3:

The specification is saying that the Unicode can be represented as ASCII in the form:

\uxxxx

Where the characters "\", "u" are ASCII characters and "xxxx" are hexadecimal (and thus, can be represented with ASCII).

This means they have formalised the translation between Unicode and ASCII for the Java programming language. All implementors of the Java programming language now can support both ASCII and Unicode editors, and similarly output stack traces etc to either ASCII or Unicode systems.

来源：https://stackoverflow.com/questions/7482914/how-does-java-handle-unicode-characters

标签

java

eclipse

unicode