Pandas dataframe and character encoding when reading excel file

后端 未结 1 1232
时光说笑
时光说笑 2021-02-05 15:24

I am reading an excel file that has several numerical and categorical data. The columns name_string contains characters in a foreign language. When I try to see the content of

1条回答
  •  庸人自扰
    2021-02-05 15:53

    Actually, the data is being parsed correctly into unicode, not strs. The u prefix indicate that the objects are unicode. When a list, tuple, or NumPy array is printed, Python shows the repr of the items in the sequence. So instead of seeing the printed version of the unicode, you see the repr:

    In [160]: repr(u'Cristina Fern\xe1ndez de Kirchner')
    Out[160]: "u'Cristina Fern\\xe1ndez de Kirchner'"
    
    In [156]: print(u'Cristina Fern\xe1ndez de Kirchner')
    Cristina Fernández de Kirchner
    

    The purpose of the repr is to provide an unambiguous string representation for each object. The printed verson of a unicode can be ambiguous because of invisible or unprintable characters.

    If you print the DataFrame or Series, however, you'll get the printed version of the unicodes:

    In [157]: df = pd.DataFrame({'foo':np.array([u'4th of July', u'911', u'Abab', u'Abass', u'Abcar', u'Abced',
           u'Ceded', u'Cedes', u'Cedfus', u'Ceding', u'Cedtim', u'Cedtol',
           u'Cedxer', u'Chevrolet Corvette', u'Chuck Norris',
           u'Cristina Fern\xe1ndez de Kirchner'], dtype=object)})
       .....:    .....:    .....: 
    In [158]: df
    Out[158]: 
                                   foo
    0                      4th of July
    1                              911
    2                             Abab
    3                            Abass
    4                            Abcar
    5                            Abced
    6                            Ceded
    7                            Cedes
    8                           Cedfus
    9                           Ceding
    10                          Cedtim
    11                          Cedtol
    12                          Cedxer
    13              Chevrolet Corvette
    14                    Chuck Norris
    15  Cristina Fernández de Kirchner
    
    [16 rows x 1 columns]
    

    0 讨论(0)
提交回复
热议问题