Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

白昼怎懂夜的黑 提交于 2019-12-12 14:25:39

问题


Trying to read a tab-separated file into pandas dataframe:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)

It errors out like so:

b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte

It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)

I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.

How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?


回答1:


Moving this answer here from another place where it got a hostile reception.

Found one standard that actually accepts (meaning, doesn't error out) byte 0xc0 :

encoding="ISO-8859-1"  

Note: This entails making sure the rest of the file doesn't have unicode chars. This may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.

More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?

New command that works:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ISO-8859-1')

After reading it in, the dataframe is fine, the columns, data are all working like they did in OpenOffice Calc. I still have no idea where the offending 0xc0 byte went but it doesn't matter as I've got the data I needed.



来源:https://stackoverflow.com/questions/49845554/read-a-file-in-python-having-rogue-byte-0xc0-that-causes-utf-8-and-ascii-to-erro

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!