Fast (< n^2) clustering algorithm

前端 未结 6 1589
情话喂你
情话喂你 2021-01-30 00:27

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be

相关标签:
6条回答
  • 2021-01-30 00:52

    I have a Perl module that does exactly what you want Algorithm::ClusterPoints.

    First, it uses the algorithm you have described in your post to divide the points in multidimensional sectors and then it uses brute force to find clusters between points in adjacent sectors.

    The complexity varies from O(N) to O(N**2) in very degraded cases.

    update:

    @Denis: no, it is much worse:

    For d dimensions, the sector (or little hypercube) size s is determined so that its diagonal l is the minimum distance c allowed between two points in different clusters.

    l = c
    l = sqrt(d * s * s)
    s = sqrt(c * c / d) = c / sqrt(d)
    

    Then you have to consider all the sectors that touch the hypersphere with diameter r = 2c + l centered in the pivot sector.

    Roughly, we have to consider ceil(r/s) rows of sectors in every directions and that means n = pow(2 * ceil(r/s) + 1, d).

    For instance, for d=5 and c=1 we get l=2.236, s=0.447, r=3.236 and n=pow(9, 5)=59049

    Actually we have to check less neighbor sectors as here we are considering those that touch the hypercube of size (2r+1)/s and we only need to check those touching the circumscribed hypersphere.

    Considering the bijective nature of the "are on the same cluster" relation we can also half the number of sectors that have to be checked.

    Specifically, Algorithm::ClusterPoints for the case where d=5 checks 3903 sectors.

    0 讨论(0)
  • 2021-01-30 00:53

    Below is a little test bench to see how fast scipy.spatial.cKDTree is on your data, and to get a rough idea of how the distances between nearby points scatter.

    A nice way to run K-cluster for various K is to build an MST of nearest pairs, and remove the K-1 longest; see Wayne, Greedy Algorithms .

    Visualizing the clusters would be fun -- project to 2d with PCA ?

    (Just curious, is your K 10, 100, 1000 ?)

    Added 17 Dec: real runtimes: 100000 x 5 10 sec, 500000 x 5 60sec

    #!/usr/bin/env python
    # time scipy.spatial.cKDTree build, query
    
    from __future__ import division
    import random
    import sys
    import time
    import numpy as np
    from scipy.spatial import cKDTree as KDTree
        # http://docs.scipy.org/doc/scipy/reference/spatial.html
        # $scipy/spatial/kdtree.py is slow but clean, 0.9 has cython
    __date__ = "2010-12-17 dec denis"
    
    def clumpiness( X, nbin=10 ):
        """ how clumpy is X ? histogramdd av, max """
            # effect on kdtree time ? not much
        N, dim = X.shape
        histo = np.histogramdd( X, nbin )[0] .astype(int)  # 10^dim
        n0 = histo.size - histo.astype(bool).sum()  # uniform: 1/e^lambda
        print "clumpiness: %d of %d^%d data bins are empty  av %.2g  max %d" % (
            n0, nbin, dim, histo.mean(), histo.max())
    
    #...............................................................................
    N = 100000
    nask = 0  # 0: ask all N
    dim = 5
    rnormal = .9
        # KDtree params --
    nnear = 2  # k=nnear+1, self
    leafsize = 10
    eps = 1  # approximate nearest, dist <= (1 + eps) * true nearest
    seed = 1
    
    exec "\n".join( sys.argv[1:] )  # run this.py N= ...
    np.random.seed(seed)
    np.set_printoptions( 2, threshold=200, suppress=True )  # .2f
    nask = nask or N
    print "\nkdtree:  dim=%d  N=%d  nask=%d  nnear=%d  rnormal=%.2g  leafsize=%d  eps=%.2g" % (
        dim, N, nask, nnear, rnormal, leafsize, eps)
    
    if rnormal > 0:  # normal point cloud, .9 => many near 1 1 1 axis
        cov = rnormal * np.ones((dim,dim)) + (1 - rnormal) * np.eye(dim)
        data = np.abs( np.random.multivariate_normal( np.zeros(dim), cov, N )) % 1
            # % 1: wrap to unit cube
    else:
        data = np.random.uniform( size=(N,dim) )
    clumpiness(data)
    ask = data if nask == N  else random.sample( data, sample )
    t = time.time()
    
    #...............................................................................
    datatree = KDTree( data, leafsize=leafsize )  # build the tree
    print "%.1f sec to build KDtree of %d points" % (time.time() - t, N)
    
    t = time.time()
    distances, ix = datatree.query( ask, k=nnear+1, eps=eps )
    print "%.1f sec to query %d points" % (time.time() - t, nask)
    
    distances = distances[:,1:]  # [:,0] is all 0, point to itself
    avdist = distances.mean( axis=0 )
    maxdist = distances.max( axis=0 )
    print "distances to %d nearest: av" % nnear, avdist, "max", maxdist
    
    # kdtree:  dim=5  N=100000  nask=100000  nnear=2  rnormal=0.9  leafsize=10  eps=1
    # clumpiness: 42847 of 10^5 data bins are empty  av 1  max 21
    # 0.4 sec to build KDtree of 100000 points
    # 10.1 sec to query 100000 points
    # distances to 2 nearest: av [ 0.07  0.08] max [ 0.15  0.18]
    
    # kdtree:  dim=5  N=500000  nask=500000  nnear=2  rnormal=0.9  leafsize=10  eps=1
    # clumpiness: 2562 of 10^5 data bins are empty  av 5  max 80
    # 2.5 sec to build KDtree of 500000 points
    # 60.1 sec to query 500000 points
    # distances to 2 nearest: av [ 0.05  0.06] max [ 0.13  0.13]
    # run: 17 Dec 2010 15:23  mac 10.4.11 ppc 
    
    0 讨论(0)
  • 2021-01-30 00:54

    You might like to try my research project called K-tree. It scales well with large inputs with respect to k-means and forms a hierarchy of clusters. The trade-off is that it produce clusters with higher distortion. It has an average case runtime of O(n log n) and worst case of O(n**2) that only happens if you have some weird topology. More details of the complexity analysis are in my Masters thesis. I have used it with very high dimensional text data and had no problems.

    Sometimes bad splits can happen in the tree where all data goes to one side (cluster). The trunk in SVN deals with this differently than the current release. It randomly splits the data if there is a bad split. The previous method can force the tree to become too deep if there are bad splits.

    0 讨论(0)
  • 2021-01-30 00:55

    Put the data into an index such as an R*-tree (Wikipedia), then you can run many density-based clustering algorithms (such as DBSCAN (Wikipedia) or OPTICS (Wikipedia)) in O(n log n).

    Density-based clustering (Wikipedia) seems to be precisely what you want ("not too far apart")

    0 讨论(0)
  • 2021-01-30 00:58

    Consider an approximate nearest neighbor (ANN) algorithm or locality sensitive hashing (LSH). They don't directly solve the clustering problem, but they will be able to tell you which points are "close" to one another. By altering the parameters, you can define close to be as close as you want. And it's fast.

    More precisely, LSH can provide a hash function, h, such that, for two points x and y, and distance metric d,

    d(x,y) <= R1  =>  P(h(x) = h(y)) >= P1
    d(x,y) >= R2  =>  P(h(x) = h(y)) <= P2
    

    where R1 < R2 and P1 > P2. So yes, it is probabilistic. You can postprocess the retrieved data to arrive at true clusters.

    Here is information on LSH including the E2LSH manual. ANN is similar in spirit; David Mount has information here, or try FLANN (has Matlab and Python bindings).

    0 讨论(0)
  • 2021-01-30 01:01

    People have the impression that k-means is slow, but slowness is really only an issue for the EM algorithm (Lloyd's). Stochastic gradient methods for k-means are orders of magnitude faster than EM (see www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf).

    An implementation is here: http://code.google.com/p/sofia-ml/wiki/SofiaKMeans

    0 讨论(0)
提交回复
热议问题