I need help selecting or creating a clustering algorithm according to certain criteria.
Imagine you are managing newspaper delivery persons.
What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).
However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...
If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.
For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.
I acknowledge that this will not necessarily provide clusters of roughly equal size:
One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005) What you do is:
Given a data set with n patterns.
Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.
Create a co-association matrix C of size n x n.
For each partition p in the ensemble:
3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.
Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendrogram as result in which we choose the clustering with the longest lifetime.
I'll provide descriptions of SL and k-means if you're interested.
I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.
The algorithm works on a list if (x,y) coords ps
that are specified as int
s. It takes three other parameters as well:
r
): given a point, what is the radius for scanning for nearby pointsmaxA
): what are the maximum number of addresses (points) per cluster?minA
): minimum addresses per clusterSet limitA=maxA
.
Main iteration:
Initialize empty list possibleSolutions
.
Outer iteration: for every point p
in ps
.
Initialize empty list pclusters
.
A worklist of points wps=copy(ps)
is defined.
Workpoint wp=p
.
Inner iteration: while wps
is not empty.
Remove the point wp
in wps
. Determine all the points wpsInRadius
in wps
that are at a distance < r
from wp
. Sort wpsInRadius
ascendingly according to the distance from wp
. Keep the first min(limitA, sizeOf(wpsInRadius))
points in wpsInRadius
. These points form a new cluster (list of points) pcluster
. Add pcluster
to pclusters
. Remove points in pcluster
from wps
. If wps
is not empty, wp=wps[0]
and continue inner iteration.
End inner iteration.
A list of clusters pclusters
is obtained. Add this to possibleSolutions
.
End outer iteration.
We have for each p
in ps
a list of clusters pclusters
in possibleSolutions
. Every pclusters
is then weighted. If avgPC
is the average number of points per cluster in possibleSolutions
(global) and avgCSize
is the average number of clusters per pclusters
(global), then this is the function that uses both these variables to determine the weight:
private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)
{
double weight = 0;
for (Cluster cluster : pclusters)
{
int ps = cluster.getPoints().size();
double psAvgPC = ps - avgPC;
weight += psAvgPC * psAvgPC / avgCSize;
weight += cluster.getSurface() / ps;
}
return new WeightedPClusters(pclusters, weight);
}
The best solution is now the pclusters
with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC)
. End main iteration.
Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.
To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16
, then we find 2 clusters of 16 addresses.
Next, if we decrease the minimum number of addresses per cluster by setting minA=12
, we find 3 clusters of 12/12/8 points.
And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7
, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r
here is only 5.
Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.
Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.
A trivial answer which does not get any bonus points:
One delivery person for each address.
Good survey of simple clustering algos. There is more though: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html
- You have a set of street addresses, each of which is geocoded.
- You want to cluster the addresses so that each cluster is assigned to a delivery person.
- The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
- Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
- For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.
As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.
If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.
The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).
Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).
If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.
I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.