Pandas read_csv and UTF-16

前端 未结 3 549
隐瞒了意图╮
隐瞒了意图╮ 2020-12-19 00:06

I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error:<

相关标签:
3条回答
  • 2020-12-19 00:50

    This is a bug, I think because csv reader was passing back an extra empty line in the beginning. It worked for me on Python 2.7.3 and pandas 0.9.1 if I do:

    In [36]: pd.read_csv(BytesIO(fh.read().decode('UTF-16').encode('UTF-8')), sep='\t', header=0)
    Out[36]: 
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 50 entries, 0 to 49
    Data columns:
    Country                             43  non-null values
    State/City                          43  non-null values
    Title                               43  non-null values
    Date                                43  non-null values
    Catalogue                           43  non-null values
    Wikipedia Election Page             43  non-null values
    Wikipedia Individual Page           43  non-null values
    Electoral Institution in Country    43  non-null values
    Twitter                             43  non-null values
    CANDIDATE NAME 1                    43  non-null values
    CANDIDATE NAME 2                    16  non-null values
    dtypes: object(11)
    

    I reported the bug here: https://github.com/pydata/pandas/issues/2418 On github master it unfortunately causes a segfault in the c-parser. We'll fix it.

    Now, interestingly: https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful ;)

    0 讨论(0)
  • 2020-12-19 00:52
    from StringIO import StringIO
    import pandas as pd
    
    a = ['Venezuela', 'N/A', 'President', '10/7/12', 'Hugo Rafael Chavez Frias', 'Hugo Ch\xc3\xa1vez', 'Hugo Ch\xc3\xa1vez', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez Fr\xc3\xadas', 'Hugo Chavez', 'Hugo Ch\xc3\xa1vez']
    
    pd.read_csv(StringIO('\t'.join(a)), delimiter='\t')
    

    works here can upload the head of your data so I can test

    0 讨论(0)
  • 2020-12-19 01:11

    Python3:

    with open('data.txt',encoding='UTF-16') as f:
        df = pd.read_csv(f)
    
    0 讨论(0)
提交回复
热议问题