What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?

后端未结

关注

 7  697

旧时难觅i 2021-02-01 18:12

I\'ve been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data dif

7条回答

庸人自扰 (楼主)

2021-02-01 18:40

I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.

If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.

If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.

But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...