How to cluster data with discrete binary attributes?

左心房为你撑大大i 提交于 2020-01-03 04:40:15

问题


In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros.

Format is like as following:

data  attribute1 attribute2 attribute3 attribute4   .........
A          0          1           0         1       .........
B          1          0           1         0       .........
C          1          1           0         1       .........
D          1          1           0         0       .........

What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high-dimensionality. Eeve if I cluster based on those few informative attribute, it's still to many attributes.

I think the decision tree is nice to cluster this data. But it's a Classification algorithm!

What can I do?


回答1:


Have you considered frequent itemset mining instead?

K-means definitely is a bad idea, but hierarchical clustering may work when using an appropriate distance function such as jaccard, hamming, dice, ...

Anyway, what is a cluster? The choice of algorithm needs to fit to the kind of cluster you want to find. On binary data, centroid-based methods such as k-means don't make sense, as centroids are not too meaningful.

If the data are "shopping cart" type of information, consider using frequent itemset mining, as it allows discovering overlapping subsets.



来源:https://stackoverflow.com/questions/20416459/how-to-cluster-data-with-discrete-binary-attributes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!