I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be
Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.
More precisely, LSH can provide a hash function, h
, such that, for two points x
and y
, and distance metric d
,
d(x,y) <= R1 => P(h(x) = h(y)) >= P1
d(x,y) >= R2 => P(h(x) = h(y)) <= P2
where R1 < R2
and P1 > P2
. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.
Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).