What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端未结

关注

 7  672

旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答

无人及你 (楼主)

2021-02-01 18:39

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
    """Removes categorical features using a given method.
       X: pd.DataFrame, dataframe to remove categorical features from."""

    if method=='fraction_unique':
        unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
        reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

    if method=='named_columns':
        non_cat_cols = [col not in cat_cols for col in X.columns]
        reduced_X = X.loc[:, non_cat_cols]

    return reduced_X

You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

0 讨论(0)

查看其它7个回答