Fast (< n^2) clustering algorithm

前端 未结 6 814
孤城傲影
孤城傲影 2021-01-30 00:34

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be

6条回答
  •  庸人自扰
    2021-01-30 01:11

    Below is a little test bench to see how fast scipy.spatial.cKDTree is on your data, and to get a rough idea of how the distances between nearby points scatter.

    A nice way to run K-cluster for various K is to build an MST of nearest pairs, and remove the K-1 longest; see Wayne, Greedy Algorithms .

    Visualizing the clusters would be fun -- project to 2d with PCA ?

    (Just curious, is your K 10, 100, 1000 ?)

    Added 17 Dec: real runtimes: 100000 x 5 10 sec, 500000 x 5 60sec

    #!/usr/bin/env python
    # time scipy.spatial.cKDTree build, query
    
    from __future__ import division
    import random
    import sys
    import time
    import numpy as np
    from scipy.spatial import cKDTree as KDTree
        # http://docs.scipy.org/doc/scipy/reference/spatial.html
        # $scipy/spatial/kdtree.py is slow but clean, 0.9 has cython
    __date__ = "2010-12-17 dec denis"
    
    def clumpiness( X, nbin=10 ):
        """ how clumpy is X ? histogramdd av, max """
            # effect on kdtree time ? not much
        N, dim = X.shape
        histo = np.histogramdd( X, nbin )[0] .astype(int)  # 10^dim
        n0 = histo.size - histo.astype(bool).sum()  # uniform: 1/e^lambda
        print "clumpiness: %d of %d^%d data bins are empty  av %.2g  max %d" % (
            n0, nbin, dim, histo.mean(), histo.max())
    
    #...............................................................................
    N = 100000
    nask = 0  # 0: ask all N
    dim = 5
    rnormal = .9
        # KDtree params --
    nnear = 2  # k=nnear+1, self
    leafsize = 10
    eps = 1  # approximate nearest, dist <= (1 + eps) * true nearest
    seed = 1
    
    exec "\n".join( sys.argv[1:] )  # run this.py N= ...
    np.random.seed(seed)
    np.set_printoptions( 2, threshold=200, suppress=True )  # .2f
    nask = nask or N
    print "\nkdtree:  dim=%d  N=%d  nask=%d  nnear=%d  rnormal=%.2g  leafsize=%d  eps=%.2g" % (
        dim, N, nask, nnear, rnormal, leafsize, eps)
    
    if rnormal > 0:  # normal point cloud, .9 => many near 1 1 1 axis
        cov = rnormal * np.ones((dim,dim)) + (1 - rnormal) * np.eye(dim)
        data = np.abs( np.random.multivariate_normal( np.zeros(dim), cov, N )) % 1
            # % 1: wrap to unit cube
    else:
        data = np.random.uniform( size=(N,dim) )
    clumpiness(data)
    ask = data if nask == N  else random.sample( data, sample )
    t = time.time()
    
    #...............................................................................
    datatree = KDTree( data, leafsize=leafsize )  # build the tree
    print "%.1f sec to build KDtree of %d points" % (time.time() - t, N)
    
    t = time.time()
    distances, ix = datatree.query( ask, k=nnear+1, eps=eps )
    print "%.1f sec to query %d points" % (time.time() - t, nask)
    
    distances = distances[:,1:]  # [:,0] is all 0, point to itself
    avdist = distances.mean( axis=0 )
    maxdist = distances.max( axis=0 )
    print "distances to %d nearest: av" % nnear, avdist, "max", maxdist
    
    # kdtree:  dim=5  N=100000  nask=100000  nnear=2  rnormal=0.9  leafsize=10  eps=1
    # clumpiness: 42847 of 10^5 data bins are empty  av 1  max 21
    # 0.4 sec to build KDtree of 100000 points
    # 10.1 sec to query 100000 points
    # distances to 2 nearest: av [ 0.07  0.08] max [ 0.15  0.18]
    
    # kdtree:  dim=5  N=500000  nask=500000  nnear=2  rnormal=0.9  leafsize=10  eps=1
    # clumpiness: 2562 of 10^5 data bins are empty  av 5  max 80
    # 2.5 sec to build KDtree of 500000 points
    # 60.1 sec to query 500000 points
    # distances to 2 nearest: av [ 0.05  0.06] max [ 0.13  0.13]
    # run: 17 Dec 2010 15:23  mac 10.4.11 ppc 
    

提交回复
热议问题