问题
I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?
回答1:
No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!
There are some modifications that improve k-means for sparse data such as spherical k-means.
But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.
回答2:
k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.
There are a few things to keep in mind:
- If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
- Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
- The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
- Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.
来源:https://stackoverflow.com/questions/18063087/is-k-means-for-clustering-data-with-many-zero-values