What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端 未结 7 693
旧时难觅i
旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答
  •  南笙
    南笙 (楼主)
    2021-02-01 18:33

    I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.

    I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:

    • % floats: percentage of values that are float
    • % int: percentage of values that are whole numbers
    • % string: percentage of values that are strings
    • % unique string: number of unique string values / total number
    • % unique integers: number of unique integer values / total number
    • mean numerical value (non numerical values considered 0 for this)
    • std deviation of numerical values

    and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.

    Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.

    A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.

提交回复
热议问题