Convert bytes to a string

后端 未结 19 2313
野性不改
野性不改 2020-11-21 04:45

I\'m using this code to get standard output from an external program:

>>> from subprocess import *
>>> command_stdout = Popen([\'ls\', \'-l         


        
相关标签:
19条回答
  • 2020-11-21 05:08

    Set universal_newlines to True, i.e.

    command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
    
    0 讨论(0)
  • 2020-11-21 05:09

    If you want to convert any bytes, not just string converted to bytes:

    with open("bytesfile", "rb") as infile:
        str = base64.b85encode(imageFile.read())
    
    with open("bytesfile", "rb") as infile:
        str2 = json.dumps(list(infile.read()))
    

    This is not very efficient, however. It will turn a 2 MB picture into 9 MB.

    0 讨论(0)
  • 2020-11-21 05:12

    Since this question is actually asking about subprocess output, you have more direct approaches available. The most modern would be using subprocess.check_output and passing text=True (Python 3.7+) to automatically decode stdout using the system default coding:

    text = subprocess.check_output(["ls", "-l"], text=True)
    

    For Python 3.6, Popen accepts an encoding keyword:

    >>> from subprocess import Popen, PIPE
    >>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
    >>> type(text)
    str
    >>> print(text)
    total 0
    -rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt
    

    The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:

    >>> b'abcde'.decode()
    'abcde'
    

    With no argument, sys.getdefaultencoding() will be used. If your data is not sys.getdefaultencoding(), then you must specify the encoding explicitly in the decode call:

    >>> b'caf\xe9'.decode('cp1250')
    'café'
    
    0 讨论(0)
  • 2020-11-21 05:13

    While @Aaron Maenpaa's answer just works, a user recently asked:

    Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!

    You can use:

    command_stdout.decode()
    

    decode() has a standard argument:

    codecs.decode(obj, encoding='utf-8', errors='strict')

    0 讨论(0)
  • 2020-11-21 05:18

    If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use the ancient MS-DOS CP437 encoding:

    PY3K = sys.version_info >= (3, 0)
    
    lines = []
    for line in stream:
        if not PY3K:
            lines.append(line)
        else:
            lines.append(line.decode('cp437'))
    

    Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English characters are not translated, because they match in most single byte encodings and UTF-8).

    Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

    >>> b'\x00\x01\xffsd'.decode('utf-8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
    start byte
    

    The same applies to latin-1, which was popular (the default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

    UPDATE 20150604: There are rumors that Python 3 has the surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests, [binary] -> [str] -> [binary], to validate both performance and reliability.

    UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

    PY3K = sys.version_info >= (3, 0)
    
    lines = []
    for line in stream:
        if not PY3K:
            lines.append(line)
        else:
            lines.append(line.decode('utf-8', 'backslashreplace'))
    

    See Python’s Unicode Support for details.

    UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower than the cp437 solution, but it should produce identical results on every Python version.

    # --- preparation
    
    import codecs
    
    def slashescape(err):
        """ codecs error handler. err is UnicodeDecode instance. return
        a tuple with a replacement for the unencodable part of the input
        and a position where encoding should continue"""
        #print err, dir(err), err.start, err.end, err.object[:err.start]
        thebyte = err.object[err.start:err.end]
        repl = u'\\x'+hex(ord(thebyte))[2:]
        return (repl, err.end)
    
    codecs.register_error('slashescape', slashescape)
    
    # --- processing
    
    stream = [b'\x80abc']
    
    lines = []
    for line in stream:
        lines.append(line.decode('utf-8', 'slashescape'))
    
    0 讨论(0)
  • 2020-11-21 05:18

    For Python 3, this is a much safer and Pythonic approach to convert from byte to string:

    def byte_to_str(bytes_or_str):
        if isinstance(bytes_or_str, bytes): # Check if it's in bytes
            print(bytes_or_str.decode('utf-8'))
        else:
            print("Object not of byte type")
    
    byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')
    

    Output:

    total 0
    -rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
    -rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2
    
    0 讨论(0)
提交回复
热议问题