Why don't scripting languages output Unicode to the Windows console?

前端 未结 9 1462
迷失自我
迷失自我 2020-12-08 11:18

The Windows console has been Unicode aware for at least a decade and perhaps as far back as Windows NT. However for some reason the major cross-platform scripting languages

相关标签:
9条回答
  • 2020-12-08 11:27

    For Python, the relevant issue in tracker is http://bugs.python.org/issue1602 (as said in comments). Note that it is open for 7 years. I tried to publish a working solution (based on information in the issue) as a Python package: https://github.com/Drekin/win-unicode-console, https://pypi.python.org/pypi/win_unicode_console.

    0 讨论(0)
  • 2020-12-08 11:30

    Are you sure your script would output Unicode on some other platform correctly? "wide character in print" warning makes me very suspicious.

    I recommend to look over this overview

    0 讨论(0)
  • 2020-12-08 11:32

    Small contribution to the discussion - I am running Czech localized Windows XP, which almost everywhere uses CP1250 code page. Funny thing with console is though that it still uses legacy DOS 852 code page.

    I was able to make very simple perl script that prints utf8 encoded data to console using:

    binmode STDOUT, ":utf8:encoding(cp852)";
    

    Tried various options (including utf16le), but only above settings printed accented Czech characters correctly.

    Edit: I played a little more with the problem and found Win32::Unicode. The module exports function printW that works properly both in output and redirected:

    use utf8;
    use Win32::Unicode;
    
    binmode STDOUT, ":utf8";
    printW "Příliš žluťoučký kůň úpěl ďábelské ódy";
    
    0 讨论(0)
  • 2020-12-08 11:39

    Unicode issues in Perl

    covers how the Win32 console works with Perl and the transcoding that happens behind the scene from ANSI to Unicode;albeit not just a Perl issue but affects other languages

    0 讨论(0)
  • 2020-12-08 11:42

    I have to unask many of your questions.

    Did you know that

    • Windows uses UTF-16 for its APIs, but still defaults to the various "fun" legacy encodings (e.g. Windows-1252, Windows-1251) in userspace, including file names, differently for the many localisations of Windows?
    • you need to encode output, and picking the appropriate encoding for the system is achieved by the locale pragma, and that there is the a POSIX standard called locale on which this is built, and Windows is incompatible with it?
    • Perl already supported the so-called "wide" APIs once?
    • Microsoft managed to adapt UTF-8 into their codepage system of character encoding, and you can switch your terminal by issuing the appropriate chcp 65001 command?
    0 讨论(0)
  • 2020-12-08 11:45

    Why on earth after all these years do they not just simply call the Win32 -W APIs that output UTF-16 Unicode instead of forcing everything through the ANSI/codepage bottleneck?

    Because Perl and Python aren't Windows programs. They're Unix programs that happen to have been mostly ported to Windows. As such, they don't like to call Win32 functions unless necessary. For byte-based I/O, it's not necessary; this can be done with the Standard C Libary. UTF-16-based I/O is a special case.

    Or are the -W APIs inherently broken to such a degree that they can't be used as-is?

    I wouldn't say that the -W APIs are inherently broken as much as I'd say that Microsoft's approach to Unicode in C(++) is inherently broken.

    No matter how much certain Windows developers insist that programs should use wchar_t instead of char, there are just too many barriers to switching:

    • Platform dependence:
      • The use of UTF-16 wchar_t on Windows and UTF-32 wchar_t elsewhere. (The new char16_t and char32_t types may help.)
      • The non-standardness of UTF-16 filename functions like _wfopen, _wstat, etc. limits the ability to use wchar_t in cross-platform code.
    • Education. Everbody learns C with printf("Hello, world!\n");, not wprintf(L"Hello, world!\n");. The C textbook I used in college never even mentioned wide characters until Appendix A.13.
    • The existing zillions of lines of code that use char* strings.
    0 讨论(0)
提交回复
热议问题