Using read_excel with converters for reading Excel file into Pandas DataFrame results in a numeric column of object type

前端 未结 3 790
囚心锁ツ
囚心锁ツ 2020-12-21 16:16

I am reading this Excel file United Nations Energy Indicators using the code snippet here:

def convert_energy(energy):
    if isinstance(energy, float):
             


        
3条回答
  •  生来不讨喜
    2020-12-21 17:02

    Let's remove the converters argument for a moment -

    c = ['Energy Supply', 'Energy Supply per Capita', '% Renewable']
    df = pd.read_excel("Energy Indicators.xls", 
                       skiprows=17, 
                       skip_footer=38, 
                       usecols=[2,3,4,5], 
                       na_values=['...'], 
                       names=c,
                       index_col=[0])
    
    df.index.name = 'Country'
    
    df.head()    
                    Energy Supply  Energy Supply per Capita  % Renewable
    Country                                                             
    Afghanistan             321.0                      10.0    78.669280
    Albania                 102.0                      35.0   100.000000
    Algeria                1959.0                      51.0     0.551010
    American Samoa            NaN                       NaN     0.641026
    Andorra                   9.0                     121.0    88.695650
    
    df.dtypes
    
    Energy Supply               float64
    Energy Supply per Capita    float64
    % Renewable                 float64
    dtype: object
    

    Your data loads just fine without a converter. There's a trick to understanding why this happens.

    By default, pandas will read in the column and try to "interpret" your data. By specifying your own converter, you override pandas conversion, so this does not happen.

    pandas passes integer and string values to convert_energy, so the isinstance(energy, float) is never evaluated to True. Instead, the else runs, and these values are returned as is, so your resultant column is a mixture of strings and integers. If you put a print(type(energy)) inside your function, this becomes obvious.

    Since you have mixtures of types, the resultant type is object. However, if you do not use a converter, pandas will attempt to interpret your data, and will successfully parse it to numeric.

    So, just doing -

    df['Energy Supply'] *= 1000000
    

    Would be more than enough.

提交回复
热议问题