I\'m trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:
file = open(\'data.txt\',\'r\')
lines = file.
Looks like UTF-16 to me.
>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'
You can work directly off the Unicode strings:
>>> float(test_utf16)
Traceback (most recent call last):
File "", line 1, in
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001
Or encode them to something different, if you prefer:
>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001
Note that you need to do this as early as possible in your processing. As your comment noted, split
will behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' '
is ' \x00'
, so split
removes the whitespace but leaves the null byte.
The 2.6 and later io
library can handle this for you, as can the older codecs
library. io
handles linefeeds better, so it's preferable if available.