Best way to decode command line inputs to Unicode Python 2.7 scripts

前端未结

关注

 2  1428

All my scripts use Unicode literals throughout, with

from __future__ import unicode_literals

but this creates a problem when there is the p

相关标签:

2条回答

鱼传尺愫

2021-01-07 14:06
sys.getfilesystemencoding() is the correct^{(but see examples)} encoding for OS data such as filenames, environment variables, and command-line arguments.

You could see the logic behind the choice: sys.argv[0] may be the path to the script (the filename) and therefore it is natural to assume that it uses the same encoding as other filenames and that other items in the argv list use the same character encoding as sys.argv[0]. os.environ['PATH'] contains paths and therefore it is also natural that environment variables use the same encoding:
```
$ echo 'import sys; print(sys.argv)' >print_argv.py
$ python print_argv.py
['print_argv.py']
```
Note: sys.argv[0] is the script filename whatever other command-line arguments you might have.

"best way" depends on your specific use-case e.g., on Windows, you should probably use Unicode API directly (CommandLineToArgvW()). On POSIX, if all you need is to pass some argv items to OS functions back (such as os.listdir()) then you could leave them as bytes -- command-line argument can be arbitrary byte sequence, see PEP 0383 -- Non-decodable Bytes in System Character Interfaces:
```
import os, sys

os.execl(sys.executable, sys.executable, '-c', 'import sys; print(sys.argv)',
         bytes(bytearray(range(1, 0x100))))
```
As you can see POSIX allows to pass any bytes (except zero).

Obviously, you can also misconfigure your environment:
```
$ LANG=C PYTHONIOENCODING=latin-1 python -c'import sys;
>   print(sys.argv, sys.stdin.encoding, sys.getfilesystemencoding())' €
(['-c', '\xe2\x82\xac'], 'latin-1', 'ANSI_X3.4-1968') # Linux output
```
The output shows that € is encoded using utf-8 but both locale and PYTHONIOENCODING are configured differently.

The examples demonstrate that sys.argv may be encoded using a character encoding that does not correspond to any of the standard encodings or it even may contain arbitrary (except zero byte) binary data on POSIX (no character encoding). On Windows, I guess, you could paste a Unicode string that can't be encoded using ANSI or OEM Windows encodings but you might get the correct value using Unicode API anyway (Python 2 probably drops data here).

Python 3 uses Unicode sys.argv and therefore it shouldn't lose data on Windows (Unicode API is used) and it allows to demonstrate that sys.getfilesystemencoding() is used (not sys.stdin.encoding) to decode sys.argv on Linux (where sys.getfilesystemencoding() is derived from locale):
```
$ LANG=C.UTF-8 PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\xb5'
$ LANG=C PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\udcc2\udcb5'
$ LANG=en_US.ISO-8859-15 PYTHONIOENCODING=latin-1 python3 -c'import sys; print(*map(ascii, sys.argv))' µ
'-c' '\xc2\xb5'
```
The output shows that LANG that defines locale in this case that defines sys.getfilesystemencoding() on Linux is used to decode the command-line arguments:
```
$ python3
>>> print(ascii(b'\xc2\xb5'.decode('utf-8')))
'\xb5'
>>> print(ascii(b'\xc2\xb5'.decode('ascii', 'surrogateescape')))
'\udcc2\udcb5'
>>> print(ascii(b'\xc2\xb5'.decode('iso-8859-15')))
'\xc2\xb5'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人及你

2021-01-07 14:13

I don't think getfilesystemencoding will necessarily get the right encoding for the shell, it depends on the shell (and can be customised by the shell, independent of the filesystem). The file system encoding is only concerned with how non-ascii filenames are stored.

Instead, you should probably be looking at sys.stdin.encoding which will give you the encoding for standard input.

Additionally, you might consider using the type keyword argument when you add an argument:

import sys
import argparse as ap

def foo(str_, encoding=sys.stdin.encoding):
    return str_.decode(encoding)

parser = ap.ArgumentParser()
parser.add_argument('my_int', type=int)
parser.add_argument('my_arg', type=foo)
args = parser.parse_args()

print repr(args)

Demo:

$ python spam.py abc hello
usage: spam.py [-h] my_int my_arg
spam.py: error: argument my_int: invalid int value: 'abc'
$ python spam.py 123 hello
Namespace(my_arg=u'hello', my_int=123)
$ python spam.py 123 ollǝɥ
Namespace(my_arg=u'oll\u01dd\u0265', my_int=123)

If you have to work with non-ascii data a lot, I would highly recommend upgrading to python3. Everything is a lot easier there, for example, parsed arguments will already be unicode on python3.

Since there is conflicting information about the command line argument encoding around, I decided to test it by changing my shell encoding to latin-1 whilst leaving the file system encoding as utf-8. For my tests I use the c-cedilla character which has a different encoding in these two:

>>> u'Ç'.encode('ISO8859-1')
'\xc7'
>>> u'Ç'.encode('utf-8')
'\xc3\x87'

Now I create an example script:

#!/usr/bin/python2.7
import argparse as ap
import sys

print 'sys.stdin.encoding is ', sys.stdin.encoding
print 'sys.getfilesystemencoding() is', sys.getfilesystemencoding()

def encoded(s):
    print 'encoded', repr(s)
    return s

def decoded_filesystemencoding(s):
    try:
        s = s.decode(sys.getfilesystemencoding())
    except UnicodeDecodeError:
        s = 'failed!'
    return s

def decoded_stdinputencoding(s):
    try:
        s = s.decode(sys.stdin.encoding)
    except UnicodeDecodeError:
        s = 'failed!'
    return s

parser = ap.ArgumentParser()
parser.add_argument('first', type=encoded)
parser.add_argument('second', type=decoded_filesystemencoding)
parser.add_argument('third', type=decoded_stdinputencoding)
args = parser.parse_args()

print repr(args)

Then I change my shell encoding to ISO/IEC 8859-1:

And I call the script:

wim-macbook:tmp wim$ ./spam.py Ç Ç Ç
sys.stdin.encoding is  ISO8859-1
sys.getfilesystemencoding() is utf-8
encoded '\xc7'
Namespace(first='\xc7', second='failed!', third=u'\xc7')

As you can see, the command line arguments were encoding in latin-1, and so the second command line argument (using sys.getfilesystemencoding) fails to decode. The third command line argument (using sys.stdin.encoding) decodes correctly.

0 讨论(0)