What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端未结

关注

 7  673

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

相关标签:

7条回答

南笙

2021-02-01 18:33
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.

A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人及你

2021-02-01 18:39

I've been looking at this, thought it maybe useful to share what I have. This builds on @Rishabh Srivastava answer.

import pandas as pd

def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
    """Removes categorical features using a given method.
       X: pd.DataFrame, dataframe to remove categorical features from."""

    if method=='fraction_unique':
        unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col)) 
        reduced_X = X.loc[:, unique_fraction>min_fraction_unique]

    if method=='named_columns':
        non_cat_cols = [col not in cat_cols for col in X.columns]
        reduced_X = X.loc[:, non_cat_cols]

    return reduced_X

You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).

0 讨论(0)

忘掉有多难

2021-02-01 18:39

IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.

For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.

Countries and such things might also be identifiable...

Age groups (".-.") might also work.

0 讨论(0)
发布评论:

提交评论
- 加载中...
庸人自扰

2021-02-01 18:40

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2021-02-01 18:53
You could define which datatypes count as numerics and then exclude the corresponding variables

If initial dataframe is df:
```
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2021-02-01 18:55
Here are a couple of approaches:
1. Find the ratio of number of unique values to the total number of unique values. Something like the following
```
likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
```
2. Check if the top n unique values account for more than a certain proportion of all values
```
top_n = 10 
likely_cat = {}
for var in df.columns:
    likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
```
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页