UTF-8 problem in python when reading chars

后端未结

关注

 5  1669

I\'m using Python 2.5. What is going on here? What have I misunderstood? How can I fix it?

in.txt:

Stäckövérfløw

code.py

相关标签:

5条回答

粉色の甜心

2021-02-06 07:22

Check this out:

# -*- coding: utf-8 -*- import pprint f = open('unicode.txt','r') for line in f: print line pprint.pprint(line) for i in line: print i, f.close()

It returns this:

Stäckövérfløw
'St\xc3\xa4ck\xc3\xb6v\xc3\xa9rfl\xc3\xb8w'
S t ? ? c k ? ? v ? ? r f l ? ? w

The thing is that the file is just being read as a string of bytes. Iterating over them splits the multibyte characters into nonsensical byte values.

0 讨论(0)

发布评论:

提交评论

加载中...

难免孤独

2021-02-06 07:22

print c,

Adds a "blank charrecter" and breaks correct utf-8 sequences into incorrect one. So this would not work unless you write a signle byte to output

sys.stdout.write(i)

0 讨论(0)

发布评论:

提交评论

加载中...

逝去的感伤

2021-02-06 07:26

Use codecs.open instead, it works for me.

#!/usr/bin/env python # -*- coding: utf-8 -*- print """Content-Type: text/plain; charset="UTF-8"\n""" f = codecs.open('in','r','utf8') for line in f: print line for i in line: print i, f.close()

0 讨论(0)

发布评论:

提交评论

加载中...

忘了有多久

2021-02-06 07:36

One may want to just use

f = open('in.txt','r') for line in f: print line for i in line.decode('utf-8'): print i, f.close()

0 讨论(0)

发布评论:

提交评论

加载中...

感动是毒

2021-02-06 07:38

for i in line: print i,

When you read the file, the string you read in is a string of bytes. The for loop iterates over a single byte at a time. This causes problems with a UTF-8 encoded string, where non-ASCII characters are represented by multiple bytes. If you want to work with Unicode objects, where the characters are the basic pieces, you should use

import codecs f = codecs.open('in', 'r', 'utf8')

If sys.stdout doesn't already have the appropriate encoding set, you may have to wrap it:

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复