What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端未结

关注

 7  693

旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答

南笙 (楼主)

2021-02-01 18:33
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
- % floats: percentage of values that are float
- % int: percentage of values that are whole numbers
- % string: percentage of values that are strings
- % unique string: number of unique string values / total number
- % unique integers: number of unique integer values / total number
- mean numerical value (non numerical values considered 0 for this)
- std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.

A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...