I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?