I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
Check if the top n unique values account for more than a certain proportion of all values
top_n = 10 likely_cat = {} for var in df.columns: likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.