What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端 未结 7 673
旧时难觅i
旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

相关标签:
7条回答
  • 2021-02-01 18:33

    I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

    I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

    • % floats: percentage of values that are float
    • % int: percentage of values that are whole numbers
    • % string: percentage of values that are strings
    • % unique string: number of unique string values / total number
    • % unique integers: number of unique integer values / total number
    • mean numerical value (non numerical values considered 0 for this)
    • std deviation of numerical values

    and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

    Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.

    A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.

    0 讨论(0)
  • 2021-02-01 18:39

    I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

    import pandas as pd
    
    def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
        """Removes categorical features using a given method.
           X: pd.DataFrame, dataframe to remove categorical features from."""
    
        if method=='fraction_unique':
            unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
            reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
    
        if method=='named_columns':
            non_cat_cols = [col not in cat_cols for col in X.columns]
            reduced_X = X.loc[:, non_cat_cols]
    
        return reduced_X
    

    You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

    0 讨论(0)
  • 2021-02-01 18:39

    IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.

    For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.

    Countries and such things might also be identifiable...

    Age groups (".-.") might also work.

    0 讨论(0)
  • 2021-02-01 18:40

    I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

    If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

    If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.

    But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

    0 讨论(0)
  • 2021-02-01 18:53

    You could define which datatypes count as numerics and then exclude the corresponding variables

    If initial dataframe is df:

    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    dataframe = df.select_dtypes(exclude=numerics)
    
    0 讨论(0)
  • 2021-02-01 18:55

    Here are a couple of approaches:

    1. Find the ratio of number of unique values to the total number of unique values. Something like the following

      likely_cat = {}
      for var in df.columns:
          likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
      
    2. Check if the top n unique values account for more than a certain proportion of all values

      top_n = 10 
      likely_cat = {}
      for var in df.columns:
          likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
      

    Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.

    0 讨论(0)
提交回复
热议问题