UTF-8 problem in python when reading chars

后端 未结 5 1669
温柔的废话
温柔的废话 2021-02-06 06:58

I\'m using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?

in.txt:

Stäckövérfløw

code.py

相关标签:
5条回答
  • 2021-02-06 07:22

    Check this out:

    # -*- coding: utf-8 -*-
    import pprint
    f = open('unicode.txt','r')
    for line in f:
        print line
        pprint.pprint(line)
        for i in line:
            print i,
    f.close()
    

    It returns this:

    Stäckövérfløw
    'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
    S t ? ? c k ? ? v ? ? r f l ? ? w

    The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.

    0 讨论(0)
  • 2021-02-06 07:22
    print c,
    

    Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output

    sys.stdout.write(i)
    
    0 讨论(0)
  • 2021-02-06 07:26

    Use codecs.open instead, it works for me.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    print """Content-Type: text/plain; charset="UTF-8"\n"""
    f = codecs.open('in','r','utf8')
    for line in f:
        print line
        for i in line:
            print i,
    f.close()
    
    0 讨论(0)
  • 2021-02-06 07:36

    One may want to just use

    f = open('in.txt','r')
    for line in f:
        print line
        for i in line.decode('utf-8'):
            print i,
    f.close()
    
    0 讨论(0)
  • 2021-02-06 07:38
    for i in line:
        print i,
    

    When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use

    import codecs
    f = codecs.open('in', 'r', 'utf8')
    

    If sys.stdout doesn't already have the appropriate encoding set, you may have to wrap it:

    sys.stdout = codecs.getwriter('utf8')(sys.stdout)
    
    0 讨论(0)
提交回复
热议问题