SciKit-Learn Label Encoder resulting in error 'argument must be a string or number'

前端 未结 1 1452
不知归路
不知归路 2021-01-12 13:47

I\'m a bit confused - creating an ML model here.

I\'m at the step where I\'m trying to take categorical features from a \"large\" dataframe (180 columns) and one-hot

相关标签:
1条回答
  • 2021-01-12 14:44

    You get this error because indeed you have a combination of floats and strings. Take a look at this example:

    # Preliminaries
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    
    # Create DataFrames
    
    # df1 has all floats
    d1 = {'LockTenor':[60.0, 45.0, 15.0, 90.0, 75.0, 30.0]}
    df1 = pd.DataFrame(data=d1)
    print("DataFrame 1")
    print(df1)
    
    # df2 has a string in the mix
    d2 = {'LockTenor':[60.0, 45.0, 'z', 90.0, 75.0, 30.0]}
    df2 = pd.DataFrame(data=d2)
    print("DataFrame 2")
    print(df2)
    
    # Create encoder
    le = LabelEncoder()
    
    # Encode first DataFrame 1 (where all values are floats)
    df1 = df1.apply(lambda col: le.fit_transform(col), axis=0, result_type='expand')
    print("DataFrame 1 encoded")
    print(df1)
    
    # Encode first DataFrame 2 (where there is a combination of floats and strings)
    df2 = df2.apply(lambda col: le.fit_transform(col), axis=0, result_type='expand')
    print("DataFrame 2 encoded")
    print(df2)
    

    If you run this code, you will see that df1 is encoded with no problem, since all its values are floats. However, you will get the error that you are reporting for df2.

    An easy fix, is to cast the column to a string. You can do this in the corresponding lambda function:

    df2 = df2.apply(lambda col: le.fit_transform(col.astype(str)), axis=0, result_type='expand')
    

    As an additional suggestion, I would recommend you take a look at your data and see if they are correct. For me, it is a bit weird having a mix of floats and strings in the same column.

    Finally, I would just like to point out that sci-kit's LabelEncoder performs a simple encoding of variables, it does not performe one-hot encoding. If you wish to do so, I recommend you take a look at OneHotEncoder

    0 讨论(0)
提交回复
热议问题