Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

前端 未结 2 1085
南方客
南方客 2020-11-29 12:54

I need to decode PowerShell stdout called from Python into a Python string.

My ultimate goal is to get in a form of a list of strings the names of network adapters o

相关标签:
2条回答
  • 2020-11-29 13:06

    It's a Python 2 bug already marked as wontfix: https://bugs.python.org/issue19264

    I must use Python 3 if you want to make it work under Windows.

    0 讨论(0)
  • 2020-11-29 13:23

    The output character encoding may depend on specific commands e.g.:

    #!/usr/bin/env python3
    import subprocess
    import sys
    
    encoding = 'utf-32'
    cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
    data = subprocess.check_output(["powershell", "-C", cmd])
    print(sys.stdout.encoding)
    print(data)
    print(ascii(data.decode(encoding)))
    

    Output

    cp437
    b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
    '\u270c\r\n'
    

    ✌ (U+270C) character is received successfully.

    The character encoding of the child script is set using PYTHONIOENCODING envvar inside the PowerShell session. I've chosen utf-32 for the output encoding so that it would be different from Windows ANSI and OEM code pages for the demonstration.

    Notice that the stdout encoding of the parent Python script is OEM code page (cp437 in this case) -- the script is run from the Windows console. If you redirect the output of the parent Python script to a file/pipe then ANSI code page (e.g., cp1252) is used by default in Python 3.

    To decode powershell output that might contain characters undecodable in the current OEM code page, you could set [Console]::OutputEncoding temporarily (inspired by @eryksun's comments):

    #!/usr/bin/env python3
    import io
    import sys
    from subprocess import Popen, PIPE
    
    char = ord('✌')
    filename = 'U+{char:04x}.txt'.format(**vars())
    with Popen(["powershell", "-C", '''
        $old = [Console]::OutputEncoding
        [Console]::OutputEncoding = [Text.Encoding]::UTF8
        echo $([char]0x{char:04x}) | fl
        echo $([char]0x{char:04x}) | tee {filename}
        [Console]::OutputEncoding = $old'''.format(**vars())],
               stdout=PIPE) as process:
        print(sys.stdout.encoding)
        for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
            print(ascii(line))
    print(ascii(open(filename, encoding='utf-16').read()))
    

    Output

    cp437
    '\u270c\n'
    '\u270c\n'
    '\u270c\n'
    

    Both fl and tee use [Console]::OutputEncoding for stdout (the default behavior is as if | Write-Output is appended to the pipelines). tee uses utf-16, to save a text to a file. The output shows that ✌ (U+270C) is decoded successfully.

    $OutputEncoding is used to decode bytes in the middle of a pipeline:

    #!/usr/bin/env python3
    import subprocess
    
    cmd = r'''
      $OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
      py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
      py -3 -c "import os; print(os.read(0, 512))"
    '''
    subprocess.check_call(["powershell", "-C", cmd])
    

    Output

    b'\xf0\x9f\x98\x8a\r\n'
    

    that is correct: b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'. With the default $OutputEncoding (ascii) we would get b'????\r\n' instead.

    Note:

    • b'\n' is replaced with b'\r\n' despite using binary API such as os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) has no effect here)
    • b'\r\n' is appended if there is no newline in the output:

      #!/usr/bin/env python3
      from subprocess import check_output
      
      cmd = '''py -3 -c "print('no newline in the input', end='')"'''
      cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"'''  # pass as is
      piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
      no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
      print('piped:   {piped}\nno pipe: {no_pipe}'.format(**vars()))
      

      Output:

      piped:   b'no newline in the input\r\n'
      no pipe: b'no newline in the input'
      

      The newline is appended to the piped output.

    If we ignore lone surrogates then setting UTF8Encoding allows to pass via pipes all Unicode characters including non-BMP characters. Text mode could be used in Python if $env:PYTHONIOENCODING = "utf-8:ignore" is configured.

    In interactive powershell running Get-NetAdapter | select Name | fl displayed correctly the name even its non-cp437 character.

    If stdout is not redirected then Unicode API is used, to print characters to the console -- any [BMP] Unicode character can be displayed if the console (TrueType) font supports it.

    When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.

    It might be due to System.Text.InternalDecoderBestFitFallback set for [Console]::OutputEncoding -- if a Unicode character can't be encoded in a given encoding then it is passed to the fallback (either a best fit char or '?' is used instead of the original character).

    Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.

    If we ignore bugs in cp65001 and a list of new encodings that are supported in later versions then the behavior should be the same.

    0 讨论(0)
提交回复
热议问题