I have a socket server that is supposed to receive UTF-8 valid characters from clients.
The problem is some clients (mainly hackers) are sending all the wrong kind of
http://docs.python.org/howto/unicode.html#the-unicode-type
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
Note: This will strip out (ignore) the characters in question returning the string without them.
For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.
Alternatively: Use the open method from the codecs module to read in the file:
import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
errors='ignore') as fdata:
Changing the engine from C to Python did the trick for me.
Engine is C:
pd.read_csv(gdp_path, sep='\t', engine='c')
'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
Engine is Python:
pd.read_csv(gdp_path, sep='\t', engine='python')
No errors for me.
I had same problem with UnicodeDecodeError
and i solved it with this line.
Don't know if is the best way but it worked for me.
str = str.decode('unicode_escape').encode('utf-8')
What can you do if you need to make a change to a file, but don’t know the file’s encoding? If you know the encoding is ASCII-compatible and only want to examine or modify the ASCII parts, you can open the file with the surrogateescape error handler:
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
data = f.read()
>>> '\x9c'.decode('cp1252')
u'\u0153'
>>> print '\x9c'.decode('cp1252')
œ
This type of issue crops up for me now that I've moved to Python 3. I had no idea Python 2 was simply steam rolling any issues with file encoding.
I found this nice explanation of the differences and how to find a solution after none of the above worked for me.
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
In short, to make Python 3 behave as similarly as possible to Python 2 use:
with open(filename, encoding="latin-1") as datafile:
# work on datafile here
However, read the article, there is no one size fits all solution.