I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.