What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端 未结 7 672
旧时难觅i
旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答
  •  无人及你
    2021-02-01 18:39

    I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

    import pandas as pd
    
    def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
        """Removes categorical features using a given method.
           X: pd.DataFrame, dataframe to remove categorical features from."""
    
        if method=='fraction_unique':
            unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
            reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
    
        if method=='named_columns':
            non_cat_cols = [col not in cat_cols for col in X.columns]
            reduced_X = X.loc[:, non_cat_cols]
    
        return reduced_X
    

    You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

提交回复
热议问题