What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端未结

关注

 7  692

旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答

别那么骄傲 (楼主)

2021-02-01 18:55
Here are a couple of approaches:
1. Find the ratio of number of unique values to the total number of unique values. Something like the following
```
likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
```
2. Check if the top n unique values account for more than a certain proportion of all values
```
top_n = 10 
likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
```
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...