What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端 未结 7 697
旧时难觅i
旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答
  •  庸人自扰
    2021-02-01 18:40

    I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

    If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

    If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.

    But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

提交回复
热议问题