I\'m looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promis
Equal size k-means is a special case of a constrained k-means procedure where each cluster must have a minimum number of points. This problem can be formulated as a graph problem where the nodes are the points to be clustered, and each point has an edge to each centroid, where the edge weight is the squared euclidean distance to the centroid. It is discussed here:
Bradley PS, Bennett KP, Demiriz A (2000), Constrained K-Means Clustering. Microsoft Research.
A Python implementation is available here.
Try this k-means variation:
Initialization:
k
centers from the dataset at random, or even better using kmeans++ strategyIn the end, you should have a paritioning that satisfies your requirements of the +-1 same number of objects per cluster (make sure the last few clusters also have the right number. The first m
clusters should have ceil
objects, the remainder exactly floor
objects.)
Iteration step:
Requisites: a list for each cluster with "swap proposals" (objects that would prefer to be in a different cluster).
E step: compute the updated cluster centers as in regular k-means
M step: Iterating through all points (either just one, or all in one batch)
Compute nearest cluster center to object / all cluster centers that are closer than the current clusters. If it is a different cluster:
The cluster sizes remain invariant (+- the ceil/floor difference), an objects are only moved from one cluster to another as long as it results in an improvement of the estimation. It should therefore converge at some point like k-means. It might be a bit slower (i.e. more iterations) though.
I do not know if this has been published or implemented before. It's just what I would try (if I would try k-means. there are much better clustering algorithms.)
Just in case anyone wants to copy and paste a short function here you go - basically running KMeans then finding the minimal matching of points to clusters under the constraint of maximal points assigned to cluster (cluster size)
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
import numpy as np
def get_even_clusters(X, cluster_size):
n_clusters = int(np.ceil(len(X)/cluster_size))
kmeans = KMeans(n_clusters)
kmeans.fit(X)
centers = kmeans.cluster_centers_
centers = centers.reshape(-1, 1, X.shape[-1]).repeat(cluster_size, 1).reshape(-1, X.shape[-1])
distance_matrix = cdist(X, centers)
clusters = linear_sum_assignment(distance_matrix)[1]//cluster_size
return clusters
Also look at K-d tree which partitions the data until each partitions' members are less than a BUCKET_SIZE which is an input to the algorithm.
This doesn't force the buckets/partitions to be exactly the same size but they'll be all less than the BUCKET_SIZE.