I\'m looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promis
This might do the trick: apply Lloyd's algorithm to get k centroids. Sort the centroids by descending size of their associated clusters in an array. For i = 1 through k-1, push the data points in cluster i with minimal distance to any other centroid j (i < j ≤ k) off to j and recompute the centroid i (but don't recompute the cluster) until the cluster size is n / k.
The complexity of this postprocessing step is O(k² n lg n).
Recently I needed this myself for a not very large dataset. My answer, although it has a relatively long running time, is guaranteed to converge to a local optimum.
def eqsc(X, K=None, G=None):
"equal-size clustering based on data exchanges between pairs of clusters"
from scipy.spatial.distance import pdist, squareform
from matplotlib import pyplot as plt
from matplotlib import animation as ani
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
def error(K, m, D):
"""return average distances between data in one cluster, averaged over all clusters"""
E = 0
for k in range(K):
i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
E += numpy.mean(D[numpy.meshgrid(i,i)])
return E / K
numpy.random.seed(0) # repeatability
N, n = X.shape
if G is None and K is not None:
G = N // K # group size
elif K is None and G is not None:
K = N // G # number of clusters
else:
raise Exception('must specify either K or G')
D = squareform(pdist(X)) # distance matrix
m = numpy.random.permutation(N) % K # initial membership
E = error(K, m, D)
# visualization
#FFMpegWriter = ani.writers['ffmpeg']
#writer = FFMpegWriter(fps=15)
#fig = plt.figure()
#with writer.saving(fig, "ec.mp4", 100):
t = 1
while True:
E_p = E
for a in range(N): # systematically
for b in range(a):
m[a], m[b] = m[b], m[a] # exchange membership
E_t = error(K, m, D)
if E_t < E:
E = E_t
print("{}: {}<->{} E={}".format(t, a, b, E))
#plt.clf()
#for i in range(N):
#plt.text(X[i,0], X[i,1], m[i])
#writer.grab_frame()
else:
m[a], m[b] = m[b], m[a] # put them back
if E_p == E:
break
t += 1
fig, ax = plt.subplots()
patches = []
for k in range(K):
i = numpy.where(m == k)[0] # indeces of datapoints belonging to class k
x = X[i]
patches.append(Polygon(x[:,:2], True)) # how to draw this clock-wise?
u = numpy.mean(x, 0)
plt.text(u[0], u[1], k)
p = PatchCollection(patches, alpha=0.5)
ax.add_collection(p)
plt.show()
if __name__ == "__main__":
N, n = 100, 2
X = numpy.random.rand(N, n)
eqsc(X, G=3)
After reading this question and several similar ones, I created a python implementation of the same-size k-means using the Elki tutorial on https://elki-project.github.io/tutorial/same-size_k_means which utilizes scikit-learn's K-Means implementation for most of the common methods and familiar API.
My implementation is found here: https://github.com/ndanielsen/Same-Size-K-Means
The clustering logic is found in this function : _labels_inertia_precompute_dense()
The ELKI data mining framework has a tutorial on equal-size k-means.
This is not a particulary good algorithm, but it's an easy enough k-means variation to write a tutorial for and teach people how to implement their own clustering algorithm variation; and apparently some people really need their clusters to have the same size, although the SSQ quality will be worse than with regular k-means.
In ELKI 0.7.5, you can select this algorithm as tutorial.clustering.SameSizeKMeansAlgorithm
.
Consider some form of recursive greedy merge -- each point begins as a singleton cluster and repeatedly merge the closest two such that such a merge doesn't exceed max. size. If you have no choice left but to exceed max size, then locally recluster. This is a form of backtracking hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering
May I humbly suggest that you try this project ekmeans.
A Java K-means Clustering implementation with an optional special equal option that apply an equal cardinality constraint on the clusters while remaining as spatially cohesive as possible.
It is yet experimental, so just be aware of the known bugs.