Is K-means for clustering data with many zero values?

落花浮王杯 提交于 2020-05-15 18:38:49

问题


I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?


回答1:


No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!

There are some modifications that improve k-means for sparse data such as spherical k-means.

But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.




回答2:


k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.

There are a few things to keep in mind:

  • If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
  • Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
  • The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
  • Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.


来源:https://stackoverflow.com/questions/18063087/is-k-means-for-clustering-data-with-many-zero-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!