K-means algorithm variation with equal cluster size

前端 未结 16 818
挽巷
挽巷 2020-11-27 14:26

I\'m looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promis

相关标签:
16条回答
  • 2020-11-27 14:53

    This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < jk) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.

    The complexity of this postprocessing step is O(k² n lg n).

    0 讨论(0)
  • 2020-11-27 14:53

    Recently I needed this myself for a not very large dataset. My answer, although it has a relatively long running time, is guaranteed to converge to a local optimum.

    def eqsc(X, K=None, G=None):
        "equal-size clustering based on data exchanges between pairs of clusters"
        from scipy.spatial.distance import pdist, squareform
        from matplotlib import pyplot as plt
        from matplotlib import animation as ani    
        from matplotlib.patches import Polygon   
        from matplotlib.collections import PatchCollection
        def error(K, m, D):
            """return average distances between data in one cluster, averaged over all clusters"""
            E = 0
            for k in range(K):
                i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
                E += numpy.mean(D[numpy.meshgrid(i,i)])
            return E / K
        numpy.random.seed(0) # repeatability
        N, n = X.shape
        if G is None and K is not None:
            G = N // K # group size
        elif K is None and G is not None:
            K = N // G # number of clusters
        else:
            raise Exception('must specify either K or G')
        D = squareform(pdist(X)) # distance matrix
        m = numpy.random.permutation(N) % K # initial membership
        E = error(K, m, D)
        # visualization
        #FFMpegWriter = ani.writers['ffmpeg']
        #writer = FFMpegWriter(fps=15)
        #fig = plt.figure()
        #with writer.saving(fig, "ec.mp4", 100):
        t = 1
        while True:
            E_p = E
            for a in range(N): # systematically
                for b in range(a):
                    m[a], m[b] = m[b], m[a] # exchange membership
                    E_t = error(K, m, D)
                    if E_t < E:
                        E = E_t
                        print("{}: {}<->{} E={}".format(t, a, b, E))
                        #plt.clf()
                        #for i in range(N):
                            #plt.text(X[i,0], X[i,1], m[i])
                        #writer.grab_frame()
                    else:
                        m[a], m[b] = m[b], m[a] # put them back
            if E_p == E:
                break
            t += 1           
        fig, ax = plt.subplots()
        patches = []
        for k in range(K):
            i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
            x = X[i]        
            patches.append(Polygon(x[:,:2], True)) # how to draw this clock-wise?
            u = numpy.mean(x, 0)
            plt.text(u[0], u[1], k)
        p = PatchCollection(patches, alpha=0.5)        
        ax.add_collection(p)
        plt.show()
    
    if __name__ == "__main__":
        N, n = 100, 2    
        X = numpy.random.rand(N, n)
        eqsc(X, G=3)
    
    0 讨论(0)
  • 2020-11-27 14:53

    After reading this question and several similar ones, I created a python implementation of the same-size k-means using the Elki tutorial on https://elki-project.github.io/tutorial/same-size_k_means which utilizes scikit-learn's K-Means implementation for most of the common methods and familiar API.

    My implementation is found here: https://github.com/ndanielsen/Same-Size-K-Means

    The clustering logic is found in this function : _labels_inertia_precompute_dense()

    0 讨论(0)
  • 2020-11-27 14:54

    The ELKI data mining framework has a tutorial on equal-size k-means.

    This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.

    In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm.

    0 讨论(0)
  • 2020-11-27 14:55

    Consider some form of recursive greedy merge -- each point begins as a singleton cluster and repeatedly merge the closest two such that such a merge doesn't exceed max. size. If you have no choice left but to exceed max size, then locally recluster. This is a form of backtracking hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering

    0 讨论(0)
  • 2020-11-27 14:59

    May I humbly suggest that you try this project ekmeans.

    A Java K-means Clustering implementation with an optional special equal option that apply an equal cardinality constraint on the clusters while remaining as spatially cohesive as possible.

    It is yet experimental, so just be aware of the known bugs.

    0 讨论(0)
提交回复
热议问题