I have a very simple bit of Scala code
var str = \"≤\"
for( ch <- str ) { printf(\"%d, %x\", ch.toInt, ch.toInt) ; println }
println
str = \"\\u2264
To answer my own questions:
Does the scala compiler work with UTF-8 encoded files?
Yes, but only if it knows they are UTF-8 encoded. In the absence of any other evidence, it uses Java's file.encoding
property. (Thanks to @AndreasNeumann for this part of the answer.)
Why did my program not behave as I expected?
Because my file.encoding
property was set to MacRoman
. Even though I had told eclipse that the file is UTF-8, this information was not communicated to the Scala compiler. Thus the compiler interpreted the 3 byte sequence E2 89 A4 as a three character sequence according to the MacRoman
encoding: a lower single quote (which looks a lot like a comma), an "a" circumflex, and a section symbol. The unicode for this 3 character sequence was U+201A U+00E2 U+00A7, which explains the output of my program.
How do you fix the problem?
On the command line for scalac use the option -encoding UTF-8
. In eclipse you can use the preferences (options) for the Scala plugin to add this option. (Thanks to @Jesper for this part of the answer.) You can also use the -D
option either on the scalac
command line or via theJAVA_OPTS
environment variable to set the file.encoding
property. (See the answer of @AndreasNeumann for details.)
If you use the Scala IDE for Eclipse, there are at least three things you can do.
Properties
), under the Resource
preferences, select UTF-8 as the Text file encoding
.-encoding UTF-8
under additional command line parameters
under Compiler >> Scala in the preferences (or options). You can set this as a global preference (or option) or as a project specific property setting.