I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif
I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X
and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique
).