I\'ve been using \"unicode strings\" in Windows for as long as... I\'ve learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32AP
what normalization form is used by default for user input
Depends on your keyboard layout/IME. It's possible to generate normal form C, D, or a crazy mixture of both if you want.
Keyboard layouts tend towards NFC because in the pre-Unicode days they'd've usually been outputting a single byte character in the local code page for each keypress. However there are exceptions.
For example using the Windows Vietnamese keyboard layout, some diacritics are typed as a single keypress combined with the letter (eg circumflex â
) and some are typed as a combining diacritical (eg grave à
). The graheme a-with-circumflex-and-grave would be typed as a-circumflex followed by combining-grave, ầ
, which would be 0xE2,0xCC in Vietnamese code page 1258, and would come out as U+00E2,U+0300 in Unicode.
This isn't in normal form C (which would be ầ
U+1EA7 Latin small letter A with circumflex and grave) nor D (which would be ầ
U+0061,U+0302,U+0300).
There is generally a cultural preference for NFC in the Windows world and on the web, and for NFD in the Apple world. But it's not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.
are the kernel and file system normalization-agnostic?
Yes, the kernel and filesystem don't know anything about normalisation and will quite happily allow you to have files with the names ầ.txt
, ầ.txt
and ầ.txt
in the same folder.