What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端 未结 7 674
旧时难觅i
旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答
  •  别那么骄傲
    2021-02-01 18:55

    Here are a couple of approaches:

    1. Find the ratio of number of unique values to the total number of unique values. Something like the following

      likely_cat = {}
      for var in df.columns:
          likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
      
    2. Check if the top n unique values account for more than a certain proportion of all values

      top_n = 10 
      likely_cat = {}
      for var in df.columns:
          likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
      

    Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

提交回复
热议问题