Unicode filenames on FAT-32?

前端 未结 2 982
南笙
南笙 2021-02-19 05:28

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).

But official MSDN documentation is very vague regarding what codepage(s) is used

相关标签:
2条回答
  • 2021-02-19 05:45

    The basic FAT or FAT32 directory entries support only short names (the old DOS 8.3 format) in the current OEM codepage. However, VFAT (FAT with long filename support) which is used while under Windows, can store an additional, so-called long filename for each file, in UTF-16.

    0 讨论(0)
  • 2021-02-19 05:46

    You might have to experiment here. This is a great question, and I'm not 100% confident, but:

    So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?

    The "OEM codepage", whatever that is for the system.

    Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?

    No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)

    And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?

    The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.

    Some citations for these answers:

    First, this page tells us that the OEM code page != the Windows code page:

    Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.

    On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).

    If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:

    ASDFΣ.TXT    asdfΣ.txt
    

    However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:

    ASDF~1.TXT   asdf?.txt
    

    (You'll likely see ?, because cmd.exe's font cannot display a Λ.)

    For information about long filenames, see this Wikipedia article.

    Also, interestingly, if you name a file "asdf©.txt", you might get:

    ASDFC.TXT    asdfc.txt
    

    … I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:

    ASDFC.TXT    asdf©.txt
    

    And this is why you should use the FooW functions.

    0 讨论(0)
提交回复
热议问题