Recently I encountered a file character encoding issue that I cannot remember ever having faced. It\'s quite common to have to be aware of character encoding of text files
I've had similar issues when using variable names that aren't ascii (Σ, σ, Δ, etc) when doing math formula. On linux, it used UTF-8 encoding while interpreting. On windows it complained about invalid names because windows uses ISO-LATIN-1. The solution was to specify the encoding in the ant script I used to compile these files.
There are no such things like a a String that was encoded as ISO-8859-1 in memory. Java Strings in memory are always Unicode strings. (Encoded in UTF-16 (as of 2011 – I think it changed with later Java versions), but you don't really need to now this).
The encoding comes only in play when you input or output the string - then, given no explicit encoding, it uses the system default (which on some systems depends on user settings).
As said by McDowell, the actual encoding of your source file should be matched by the encoding which your compiler assumes about your source file, otherwise you get problems as you observed. You can achieve this by several means:
-encoding
option of the compiler, giving the encoding of your source file. (With ant, you set the encoding=
parameter.) recode
) to change the encoding of your file to the compiler default.native2ascii
(with the right -encoding
option) to translate your source file to ASCII with \uXXXX
-escapes.In the last case, you later can compile this file everywhere with every default encoding, so this may be the way to go if you give the sourcecode to encoding-unaware persons to compile somewhere.
If you have a bigger project consisting of more than one file, they should all have the same encoding, since the compiler has only one such switch, not several.
In all projects I had in the last years, I always encode all my files in UTF-8, and in my ant buildfile set the encoding="utf-8"
parameter to the javac task. (My editor is smart enough to automatically recognize the encoding, but I set the default to UTF-8.)
The encoding matters to other source-code handling tools to, like javadoc. (There you should additionally the -charset
and -docencoding
options for the output - they should match, but can be different to the source--encoding
.)
I'd hazard a guess that there is a transcoding issue during the compilation stage and the compiler lacks direction as to the encoding of a source file (e.g. see the javac -encoding switch).
Compilers generally use the system default encoding if you aren't specific which can lead to string and char literals being corrupted (internally, Java bytecode uses a modified UTF-8 form, so binaries are portable). This is the only way I can imagine that problems are being introduced at compile time.
I've written a bit about this here.
Always use escape codes (e.g \uxxxx
) in your source files and this will not be a problem. @Paulo mentioned this, but i wanted to call it out explicitly.