K-means algorithm variation with equal cluster size

前端未结

关注

 16  818

I\'m looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promis

相关标签:

16条回答

悲哀的现实

2020-11-27 14:53

This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.

The complexity of this postprocessing step is O(k² n lg n).

0 讨论(0)
发布评论:

提交评论
- 加载中...

情歌与酒

2020-11-27 14:53

Recently I needed this myself for a not very large dataset. My answer, although it has a relatively long running time, is guaranteed to converge to a local optimum.

def eqsc(X, K=None, G=None):
    "equal-size clustering based on data exchanges between pairs of clusters"
    from scipy.spatial.distance import pdist, squareform
    from matplotlib import pyplot as plt
    from matplotlib import animation as ani    
    from matplotlib.patches import Polygon   
    from matplotlib.collections import PatchCollection
    def error(K, m, D):
        """return average distances between data in one cluster, averaged over all clusters"""
        E = 0
        for k in range(K):
            i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
            E += numpy.mean(D[numpy.meshgrid(i,i)])
        return E / K
    numpy.random.seed(0) # repeatability
    N, n = X.shape
    if G is None and K is not None:
        G = N // K # group size
    elif K is None and G is not None:
        K = N // G # number of clusters
    else:
        raise Exception('must specify either K or G')
    D = squareform(pdist(X)) # distance matrix
    m = numpy.random.permutation(N) % K # initial membership
    E = error(K, m, D)
    # visualization
    #FFMpegWriter = ani.writers['ffmpeg']
    #writer = FFMpegWriter(fps=15)
    #fig = plt.figure()
    #with writer.saving(fig, "ec.mp4", 100):
    t = 1
    while True:
        E_p = E
        for a in range(N): # systematically
            for b in range(a):
                m[a], m[b] = m[b], m[a] # exchange membership
                E_t = error(K, m, D)
                if E_t < E:
                    E = E_t
                    print("{}: {}<->{} E={}".format(t, a, b, E))
                    #plt.clf()
                    #for i in range(N):
                        #plt.text(X[i,0], X[i,1], m[i])
                    #writer.grab_frame()
                else:
                    m[a], m[b] = m[b], m[a] # put them back
        if E_p == E:
            break
        t += 1           
    fig, ax = plt.subplots()
    patches = []
    for k in range(K):
        i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
        x = X[i]        
        patches.append(Polygon(x[:,:2], True)) # how to draw this clock-wise?
        u = numpy.mean(x, 0)
        plt.text(u[0], u[1], k)
    p = PatchCollection(patches, alpha=0.5)        
    ax.add_collection(p)
    plt.show()

if __name__ == "__main__":
    N, n = 100, 2    
    X = numpy.random.rand(N, n)
    eqsc(X, G=3)

0 讨论(0)

说谎

2020-11-27 14:53

After reading this question and several similar ones, I created a python implementation of the same-size k-means using the Elki tutorial on https://elki-project.github.io/tutorial/same-size_k_means which utilizes scikit-learn's K-Means implementation for most of the common methods and familiar API.

My implementation is found here: https://github.com/ndanielsen/Same-Size-K-Means

The clustering logic is found in this function : _labels_inertia_precompute_dense()

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2020-11-27 14:54

The ELKI data mining framework has a tutorial on equal-size k-means.

This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.

In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-11-27 14:55

Consider some form of recursive greedy merge -- each point begins as a singleton cluster and repeatedly merge the closest two such that such a merge doesn't exceed max. size. If you have no choice left but to exceed max size, then locally recluster. This is a form of backtracking hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering

0 讨论(0)
发布评论:

提交评论
- 加载中...
迷失自我

2020-11-27 14:59

May I humbly suggest that you try this project ekmeans.

A Java K-means Clustering implementation with an optional special equal option that apply an equal cardinality constraint on the clusters while remaining as spatially cohesive as possible.

It is yet experimental, so just be aware of the known bugs.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页