I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be
I have a Perl module that does exactly what you want Algorithm::ClusterPoints.
First, it uses the algorithm you have described in your post to divide the points in multidimensional sectors and then it uses brute force to find clusters between points in adjacent sectors.
The complexity varies from O(N) to O(N**2) in very degraded cases.
update:
@Denis: no, it is much worse:
For d
dimensions, the sector (or little hypercube) size s
is determined so that its diagonal l
is the minimum distance c
allowed between two points in different clusters.
l = c
l = sqrt(d * s * s)
s = sqrt(c * c / d) = c / sqrt(d)
Then you have to consider all the sectors that touch the hypersphere with diameter r = 2c + l
centered in the pivot sector.
Roughly, we have to consider ceil(r/s)
rows of sectors in every directions and that means n = pow(2 * ceil(r/s) + 1, d)
.
For instance, for d=5
and c=1
we get l=2.236
, s=0.447
, r=3.236
and n=pow(9, 5)=59049
Actually we have to check less neighbor sectors as here we are considering those that touch the hypercube of size (2r+1)/s
and we only need to check those touching the circumscribed hypersphere.
Considering the bijective nature of the "are on the same cluster" relation we can also half the number of sectors that have to be checked.
Specifically, Algorithm::ClusterPoints for the case where d=5
checks 3903 sectors.
Below is a little test bench to see how fast scipy.spatial.cKDTree is on your data, and to get a rough idea of how the distances between nearby points scatter.
A nice way to run K-cluster for various K is to build an MST of nearest pairs, and remove the K-1 longest; see Wayne, Greedy Algorithms .
Visualizing the clusters would be fun -- project to 2d with PCA ?
(Just curious, is your K 10, 100, 1000 ?)
Added 17 Dec: real runtimes: 100000 x 5 10 sec, 500000 x 5 60sec
#!/usr/bin/env python
# time scipy.spatial.cKDTree build, query
from __future__ import division
import random
import sys
import time
import numpy as np
from scipy.spatial import cKDTree as KDTree
# http://docs.scipy.org/doc/scipy/reference/spatial.html
# $scipy/spatial/kdtree.py is slow but clean, 0.9 has cython
__date__ = "2010-12-17 dec denis"
def clumpiness( X, nbin=10 ):
""" how clumpy is X ? histogramdd av, max """
# effect on kdtree time ? not much
N, dim = X.shape
histo = np.histogramdd( X, nbin )[0] .astype(int) # 10^dim
n0 = histo.size - histo.astype(bool).sum() # uniform: 1/e^lambda
print "clumpiness: %d of %d^%d data bins are empty av %.2g max %d" % (
n0, nbin, dim, histo.mean(), histo.max())
#...............................................................................
N = 100000
nask = 0 # 0: ask all N
dim = 5
rnormal = .9
# KDtree params --
nnear = 2 # k=nnear+1, self
leafsize = 10
eps = 1 # approximate nearest, dist <= (1 + eps) * true nearest
seed = 1
exec "\n".join( sys.argv[1:] ) # run this.py N= ...
np.random.seed(seed)
np.set_printoptions( 2, threshold=200, suppress=True ) # .2f
nask = nask or N
print "\nkdtree: dim=%d N=%d nask=%d nnear=%d rnormal=%.2g leafsize=%d eps=%.2g" % (
dim, N, nask, nnear, rnormal, leafsize, eps)
if rnormal > 0: # normal point cloud, .9 => many near 1 1 1 axis
cov = rnormal * np.ones((dim,dim)) + (1 - rnormal) * np.eye(dim)
data = np.abs( np.random.multivariate_normal( np.zeros(dim), cov, N )) % 1
# % 1: wrap to unit cube
else:
data = np.random.uniform( size=(N,dim) )
clumpiness(data)
ask = data if nask == N else random.sample( data, sample )
t = time.time()
#...............................................................................
datatree = KDTree( data, leafsize=leafsize ) # build the tree
print "%.1f sec to build KDtree of %d points" % (time.time() - t, N)
t = time.time()
distances, ix = datatree.query( ask, k=nnear+1, eps=eps )
print "%.1f sec to query %d points" % (time.time() - t, nask)
distances = distances[:,1:] # [:,0] is all 0, point to itself
avdist = distances.mean( axis=0 )
maxdist = distances.max( axis=0 )
print "distances to %d nearest: av" % nnear, avdist, "max", maxdist
# kdtree: dim=5 N=100000 nask=100000 nnear=2 rnormal=0.9 leafsize=10 eps=1
# clumpiness: 42847 of 10^5 data bins are empty av 1 max 21
# 0.4 sec to build KDtree of 100000 points
# 10.1 sec to query 100000 points
# distances to 2 nearest: av [ 0.07 0.08] max [ 0.15 0.18]
# kdtree: dim=5 N=500000 nask=500000 nnear=2 rnormal=0.9 leafsize=10 eps=1
# clumpiness: 2562 of 10^5 data bins are empty av 5 max 80
# 2.5 sec to build KDtree of 500000 points
# 60.1 sec to query 500000 points
# distances to 2 nearest: av [ 0.05 0.06] max [ 0.13 0.13]
# run: 17 Dec 2010 15:23 mac 10.4.11 ppc
You might like to try my research project called K-tree. It scales well with large inputs with respect to k-means and forms a hierarchy of clusters. The trade-off is that it produce clusters with higher distortion. It has an average case runtime of O(n log n) and worst case of O(n**2) that only happens if you have some weird topology. More details of the complexity analysis are in my Masters thesis. I have used it with very high dimensional text data and had no problems.
Sometimes bad splits can happen in the tree where all data goes to one side (cluster). The trunk in SVN deals with this differently than the current release. It randomly splits the data if there is a bad split. The previous method can force the tree to become too deep if there are bad splits.
Put the data into an index such as an R*-tree (Wikipedia), then you can run many density-based clustering algorithms (such as DBSCAN (Wikipedia) or OPTICS (Wikipedia)) in O(n log n)
.
Density-based clustering (Wikipedia) seems to be precisely what you want ("not too far apart")
Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.
More precisely, LSH can provide a hash function, h
, such that, for two points x
and y
, and distance metric d
,
d(x,y) <= R1 => P(h(x) = h(y)) >= P1
d(x,y) >= R2 => P(h(x) = h(y)) <= P2
where R1 < R2
and P1 > P2
. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.
Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).
People have the impression that k-means is slow, but slowness is really only an issue for the EM algorithm (Lloyd's). Stochastic gradient methods for k-means are orders of magnitude faster than EM (see www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf).
An implementation is here: http://code.google.com/p/sofia-ml/wiki/SofiaKMeans