Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

一个人想着一个人 提交于 2019-12-11 21:23:15

问题


I've tried io, repr() etc, they don't work!

Problem inputting å (\xe5):

(None of these work)

import sys
print(sys.stdin.read(1))


sys.stdin = io.TextIOWrapper(sys.stdin.detach(), errors='replace', encoding='iso-8859-1', newline='\n')
print(sys.stdin.read(1))


x = sys.stdin.buffer.read(1)
print(x.decode('utf-8'))

They all give me roughly UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: unexpected end of data

Also tried starting Python with: export PYTHONIOENCODING=utf-8 doesn't work either.


Now, here's where i'm at:

import sys, codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
sys.stdin = codecs.getwriter("utf-8")(sys.stdin.detach())

x = sys.stdin.read(1)

print(x.decode('utf-8', 'replace'))

This gives me: �
It's close...

How can i take a \xe5 and turn it into å in my console? Without it breaking input() as well, because this solution breaks it.

Note: I know this has been asked before, but non of those solve it.. especially not io


Some info of my system

os.environ['LANG'] == 'C'
sys.getdefaultencoding() == 'utf-8'
sys.stdout.encoding == 'ANSI_X3.4-1968'
sys.stdin.encoding == 'ANSI_X3.4-1968'

My os: ArchLinux running xterm
Running locale -a gives me: C | POSIX | sv_SE.utf8

I've followed these:

  • Python 3: How to specify stdin encoding
  • http://python-notes.curiousefficiency.org/en/latest/python3/binary_protocols.html
  • http://wolfprojects.altervista.org/talks/unicode-and-python-3/
  • http://getpython3.com/diveintopython3/strings.html
  • Python 3 - Encode/Decode vs Bytes/Str
  • How to set sys.stdout encoding in Python 3?
  • http://docs.python.org/3.0/howto/unicode.html

(and a few 50 more)

Solution (sort of, still breaks input())

sys.stdout = codecs.getwriter("latin-1")(sys.stdout.detach())
sys.stdin = codecs.getwriter("latin-1")(sys.stdin.detach())

x = sys.stdin.read(1)

print(x.decode('latin-1', 'replace'))

回答1:


You are running this in xterm, which does not support UTF-8 by default. Run it as xterm -u8 or use uxterm to fix that.

The other way to work around that, is to use a different locale; set your locale to Latin-1 for example:

export LANG=sv_SE.ISO-8859-1

but then you are limited to 256 codepoints, versus the full range (several million) of the Unicode standard.

Note that Python 2 never decoded the input; writing out what you read from the terminal will look fine because the raw bytes you read are interpreted by the terminal in the same locale; reading and writing Latin-1 bytes works just fine. That's not quite the same as processing Unicode data, however.




回答2:


(sorry martijn, you're awesome but) I just hate when you need to circumvent an issue and blame it on something instead of fixing it with programming.

And here's the solution to the poison that is Python3:

import sys, codecs
sys.stdout = codecs.getwriter("latin-1")(sys.stdout.detach())
sys.stdin = codecs.getwriter("latin-1")(sys.stdin.detach())
sys.stdout.write(sys.stdin.read(1).decode('latin-1', 'replace'))

This does not only make you choose/match against your terminals "encoding", it actually requires no outside influence (such as export LANG=sv_SE.ISO-8859-1).

The only downside:

input('something: ')

Will break, fix for that is:

# Since it's bad practice to name function the
# same as __builtins__, we'll go ahead and call it something
# we're used to but isn't in use any more.
def raw_input(txt):
    sys.stdout.write(txt)
    sys.stdout.flush()
    sys.stdin.flush()
    return sys.stdin.readline().strip()

And all is well in paradise, i LOVE to stick it to the man (Python3)..
A big thanks to Martijn for telling why and that in fact the data is latin-1!



来源:https://stackoverflow.com/questions/18260859/python3-ascii-utf-8-iso-8859-1-cant-decode-byte-0xe5-swedish-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!