How to decode unicode in a Chinese text

前端 未结 4 1484
情书的邮戳
情书的邮戳 2021-01-01 05:49
with open(\'result.txt\', \'r\') as f:
data = f.read()

print \'What type is my data:\'
print type(data)

for i in data:
    print \"what is i:\"
    print i
    pri         


        
相关标签:
4条回答
  • 2021-01-01 06:26

    data is a bytestring (str type on Python 2). Your loop looks at one byte at a time (non-ascii characters may be represented using more than one byte in utf-8).

    Don't call .encode() on bytes:

    $ python2
    >>> '\xe3'.enϲodе('utf˗8') #XXX don't do it
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
    

    I am trying to read the file and split the words by space and save them into a list.

    To work with Unicode text, use unicode type in Python 2. You could use io.open() to read Unicode text from a file (here's the code that collects all space-separated words into a list):

    #!/usr/bin/env python
    import io
    
    with io.open('result.txt', encoding='utf-8') as file:
        words = [word for line in file for word in line.split()]
    print "\n".join(words)
    
    0 讨论(0)
  • 2021-01-01 06:32

    Encoding:

    $ python
    Python 3.7.4 (default, Aug 13 2019, 15:17:50)
    [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import base64
    >>> base64.b64encode("我们尊重原创。".encode('utf-8'))
    b'5oiR5Lus5bCK6YeN5Y6f5Yib44CC'
    

    Decoding:

    >>> import base64
    >>> str='5oiR5Lus5bCK6YeN5Y6f5Yib44CC'
    >>> base64.b64decode(str)
    b'\xe6\x88\x91\xe4\xbb\xac\xe5\xb0\x8a\xe9\x87\x8d\xe5\x8e\x9f\xe5\x88\x9b\xe3\x80\x82'
    >>> base64.b64decode(str).decode('utf-8')
    '我们尊重原创。'
    >>>
    
    0 讨论(0)
  • 2021-01-01 06:35

    Let me give you some hints:

    • You'll need to decode the bytes you read from UTF-8 into Unicode before you try to iterate over the words.
    • When you read a file, you won't get Unicode back. You'll just get plain bytes. (I think you knew that, since you're already using decode().)
    • There is a standard function to "split by space" called split().
    • When you say for i in data, you're saying you want to iterate over every byte of the file you just read. Each iteration of your loop will be a single character. I'm not sure if that's what you want, because that would mean you'd have to do UTF-8 decoding by hand (rather than using decode(), which must operate on the entire UTF-8 string.).

    In other words, here's one line of code that would do it:

    open('file.txt').read().decode('utf-8').split()
    

    If this is homework, please don't turn that in. Your teacher will be onto you. ;-)


    Edit: Here's an example how to encode and decode unicode characters in python:

    >>> data = u"わかりません"
    >>> data
    u'\u308f\u304b\u308a\u307e\u305b\u3093'
    >>> data_you_would_see_in_a_file = data.encode('utf-8')
    >>> data_you_would_see_in_a_file
    '\xe3\x82\x8f\xe3\x81\x8b\xe3\x82\x8a\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93'
    >>> for each_unicode_character in data_you_would_see_in_a_file.decode('utf-8'):
    ...     print each_unicode_character
    ... 
    わ
    か
    り
    ま
    せ
    ん
    

    The first thing to note is that Python (well, at least Python 2) uses the u"" notation (note the u prefix) on string constants to show that they are Unicode. In Python 3, strings are Unicode by default, but you can use b"" if you want a byte string.

    As you can see, the Unicode string is composed of two-byte characters. When you read the file, you get a string of one-byte characters (which is equivalent to what you get when you call .encode(). So if you have bytes from a file, you must call .decode() to convert them back into Unicode. Then you can iterate over each character.

    Splitting "by space" is something unique to every language, since many languages (for example, Chinese and Japanese) do not uses the ' ' character, like most European languages would. I don't know how to do that in Python off the top of my head, but I'm sure there is a way.

    0 讨论(0)
  • 2021-01-01 06:40

    When you call encode on a str with most (all?) codecs (for which encode really makes no sense; str is a byte oriented type, not a true text type like unicode that would require encoding), Python is implicitly decodeing it as ASCII first, then encoding with your specified encoding. If you want the str to be interpreted as something other than ASCII, you need to decode from bytes-like str to true text unicode yourself.

    When you do i.encode('utf-8') when i is a str, you're implicitly saying i is logically text (represented by bytes in the locale default encoding), not binary data. So in order to encode it, it first needs to decode it to determine what the "logical" text is. Your input is probably encoded in some ASCII superset (e.g. latin-1, or even utf-8), and contains non-ASCII bytes; it tries to decode them using the ascii codec (to figure out the true Unicode ordinals it needs to encode as utf-8), and fails.

    You need to do one of:

    1. Explicitly decode the str you read using the correct codec (to get a unicode object), then encode that back to utf-8.
    2. Let Python do the work from #1 for you implicitly. Instead of using open, import io and use io.open (Python 2.7+ only; on Python 3+, io.open and open are the same function), which gets you an open that works like Python 3's open. You can pass this open an encoding argument (e.g. io.open('/path/to/file', 'r', encoding='latin-1')) and reading from the resulting file object will get you already decode-ed unicode objects (that can then be encode-ed to whatever you like with).

    Note: #1 will not work if the real encoding is something like utf-8 and you defer the work until you're iterating character by character. For non-ASCII characters, utf-8 is multibyte, so if you only have one byte, you can't decode (because the following bytes are needed to calculate a single ordinal). This is a reason to prefer using io.open to read as unicode natively so you're not worrying about stuff like this.

    0 讨论(0)
提交回复
热议问题