The Haskell 2010 Language Report says:
Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set
Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.
Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.
While the Haskell standard simply says Unicode the set of possible characters (as opposed to e.g. ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.
Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded * which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.
* - This is actually a real problem as GHC stores strings and more importantly Data.Text
internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.
There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.
In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.
There is an important distinction between the data type (i.e. what “abstract” data you can work with) and its representation (i.e. how it is stored in the computer memory or on disk).
The Haskell Report says two things related to Unicode:
That the Char
data type in Haskell represents a Unicode character (also known as code point). You should think of it as of an abstract data type that provides a certain interface (e.g. you can call isDigit
or toLower
on it), but you are not allowed to know how exactly it is represented internally. The specific implementation of Haskell (e.g. GHC) is free to represent it in memory in whatever way it wants and it doesn’t matter at all, as you can’t access the underlying raw bits anyway.
That a Haskell program is text, consisting of (abstract) Unicode code points, that is, essentially, a String
. And then it goes on to explain how to parse this String
. Once again, it is important to stress that it defines the syntax of Haskell in terms of sequences of abstract Unicode code points.
Now, to your question about Haskell source code. The Haskell Report does not specify how this Unicode text is encoded into zeroes and ones when stored in a file.
In fact, the Haskell Report does not specify how Haskell programs are stored at all! It doesn’t mention that Haskell source code is stored in files, that files have to be named after modules, and that the directory structure should follow the structure of module names – these all are considered to be compiler implementation details, and the idea is that this allows each compiler to store Haskell programs wherever and however they want: in files, in database tables, as jpeg photos of a blackboard with a program written on it with chalk. For this reason it does not specify the encoding either (it would make no sense to specify the encoding for a program written out on a blackboard