K-means algorithm variation with equal cluster size

前端未结

关注

 16  823

I\'m looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promis

相关标签:

16条回答

爱一瞬间的悲伤

2020-11-27 15:00
There is a cleaner post-processing, given cluster centroids. Let N be the number of items, K the number of clusters and S = ceil(N/K) maximum cluster size.
- Create a list of tuples (item_id, cluster_id, distance)
- Sort tuples with respect to distance
- For each element (item_id, cluster_id, distance) in the sorted list of tuples:
  - if number of elements in cluster_id exceeds S do nothing
  - otherwise add item_id to cluster cluster_id.
This runs in O(NK lg(N)), should give comparable results to @larsmans solution and is easier to implement. In pseudo-python
```
dists = []
clusts = [None] * N
counts = [0] * K

for i, v in enumerate(items):
    dist = map( lambda x: dist(x, v), centroids )
    dd = map( lambda (k, v): (i, k, v), enumerate(dist) )
    dists.extend(dd)

dists = sorted(dists, key = lambda (x,y,z): z)

for (item_id, cluster_id, d) in dists:
    if counts[cluster_id] >= S:
        continue
    if clusts[item_id] == None:
        clusts[item_id] = cluster_id
        counts[cluster_id] = counts[cluster_id] + 1
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2020-11-27 15:06

You can view the distances as defining a weighted graph. Quite a few graph partitioning algorithms are explicitly based on trying to partition the graph vertices into two sets of equal size. This appears in, for example, the Kernighan-Lin algorithm and in spectral graph partitioning using the Laplacian. To get multiple clusters, you can recursively apply the partitioning algorithm; there's a nice discussion of this at the link on spectral graph partitioning.

0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-11-27 15:08

You want to take a look into space-filling-curve, for example a z-curve or a hilbert-curve. You can think of a space-filling-curve to reduce the 2-Dimensional problem to a 1-Dimensional problem. Although the sfc index is only a reorder of the 2-Dimensional data and not a perfect clustering of the data it can be useful when the solution has not to satisfied a perfect cluster and has to be compute fairly fast.

0 讨论(0)
发布评论:

提交评论
- 加载中...

名媛妹妹

2020-11-27 15:10

Added Jan 2012: Better than postprocessing would be to keep cluster sizes roughly the same as they grow.
For example, find for each X the 3 nearest centres, then add X to the one with the best distance - λ clustersize.

A simple greedy postprocess after k-means may be good enough, if your clusters from k-means are roughly equal-sized.
(This approximates an assignment algorithm on the Npt x C distance matrix from k-means.)

One could iterate

diffsizecentres = kmeans( X, centres, ... )
X_centre_distances = scipy.spatial.distance.cdist( X, diffsizecentres )
    # or just the nearest few centres
xtoc = samesizeclusters( X_centre_distances )
samesizecentres = [X[xtoc[c]].mean(axis=0) for c in range(k)]
...

I'd be surprised if iterations changed the centres much, but it'll depend ™.

About how big are your Npoint Ncluster and Ndim ?

#!/usr/bin/env python
from __future__ import division
from operator import itemgetter
import numpy as np

__date__ = "2011-03-28 Mar denis"

def samesizecluster( D ):
    """ in: point-to-cluster-centre distances D, Npt x C
            e.g. from scipy.spatial.distance.cdist
        out: xtoc, X -> C, equal-size clusters
        method: sort all D, greedy
    """
        # could take only the nearest few x-to-C distances
        # add constraints to real assignment algorithm ?
    Npt, C = D.shape
    clustersize = (Npt + C - 1) // C
    xcd = list( np.ndenumerate(D) )  # ((0,0), d00), ((0,1), d01) ...
    xcd.sort( key=itemgetter(1) )
    xtoc = np.ones( Npt, int ) * -1
    nincluster = np.zeros( C, int )
    nall = 0
    for (x,c), d in xcd:
        if xtoc[x] < 0  and  nincluster[c] < clustersize:
            xtoc[x] = c
            nincluster[c] += 1
            nall += 1
            if nall >= Npt:  break
    return xtoc

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from scipy.spatial import distance
    # http://docs.scipy.org/doc/scipy/reference/spatial.distance.html

    Npt = 100
    C = 3
    dim = 3
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 2, threshold=200, edgeitems=5, suppress=True )  # .2f
    random.seed(seed)
    np.random.seed(seed)

    X = np.random.uniform( size=(Npt,dim) )
    centres = random.sample( X, C )
    D = distance.cdist( X, centres )
    xtoc = samesizecluster( D )
    print "samesizecluster sizes:", np.bincount(xtoc)
        # Npt=100 C=3 -> 32 34 34

0 讨论(0)

盖世英雄少女心

2020-11-27 15:10

I've been struggling on how to solve this question too. However, I realize that I have used the wrong keyword all this time. If you want the number of point result member to be same size, you are doing a grouping, not clustering anymore. I finally able to solve the problem using simple python script and postgis query.

For example, I have a table called tb_points which has 4000 coordinate point, and you want to divide it into 10 same size group, which will contain 400 coordinate point each. Here is the example of the table structure

CREATE TABLE tb_points (
  id SERIAL PRIMARY KEY,
  outlet_id INTEGER,
  longitude FLOAT,
  latitide FLOAT,
  group_id INTEGER
);

Then what you need to do are:

Find the first coordinate that will be your starting point
Find nearest coordinate from your starting point, order by distance ascending, limit the result by the number of your preferred member (in this case 400)
Update the result by updating the group_id column
Do 3 steps above 10 times for the rest of data, which group_id column is still NULL

This is the implementation in python:

import psycopg2

dbhost = ''
dbuser = ''
dbpass = ''
dbname = ''
dbport = 5432

conn = psycopg2.connect(host = dbhost,
       user = dbuser,
       password = dbpass,
       database = dbname,
       port = dbport)

def fetch(sql):
    cursor = conn.cursor()
    rs = None
    try:
        cursor.execute(sql)
        rs = cursor.fetchall()
    except psycopg2.Error as e:
        print(e.pgerror)
        rs = 'error'
    cursor.close()
    return rs

def execScalar(sql):
    cursor = conn.cursor()
    try:
        cursor.execute(sql)
        conn.commit()
        rowsaffected = cursor.rowcount
    except psycopg2.Error as e:
        print(e.pgerror)
        rowsaffected = -1
        conn.rollback()
    cursor.close()
    return rowsaffected


def select_first_cluster_id():
    sql = """ SELECT a.outlet_id as ori_id, a.longitude as ori_lon,
    a.latitude as ori_lat, b.outlet_id as dest_id, b.longitude as
    dest_lon, b.latitude as dest_lat,
    ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326)
    AS geography), 
    CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography))
    AS air_distance FROM  tb_points a CROSS JOIN tb_points b WHERE
    a.outlet_id != b.outlet_id and a.group_id is NULL and b.group_id is
    null order by air_distance desc limit 1 """
    return sql

def update_group_id(group_id, ori_id, limit_constraint):
    sql = """ UPDATE tb_points
    set group_id = %s
    where outlet_id in
    (select b.outlet_id
    from tb_points a,
    tb_points b
    where a.outlet_id = '%s'
    and a.group_id is null
    and b.group_id is null
    order by ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326) AS geography),
    CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography)) asc
    limit %s)
    """ % (group_id, ori_id, limit_constraint)
    return sql

def clustering():
    data_constraint = [100]
    n = 1
    while n <= 10:
        sql = select_first_cluster_id()
        res = fetch(sql)
        ori_id = res[0][0]

        sql = update_group_id(n, ori_id, data_constraint[0])
        print(sql)
        execScalar(sql)

        n += 1

clustering()

Hope it helps

0 讨论(0)

半阙折子戏

2020-11-27 15:11
In general, grouping points on a map into equally sized groups, by distance is a impossible mission in theory. Because grouping points into equally sized groups is in conflict with grouping points in clusters by distance.

see this plot:

There are four points:
```
A.[1,1]
B.[1,2]
C.[2,2]
D.[5,5]
```
If we cluster these points into two cluster. Obviously, (A,B,C) will be cluster 1, D will be cluster 2. But if we need equally sized groups, (A,B) will be one cluster, (C,D) will be the other. This violates cluster rules because C is closer to center of (A,B) but it belongs to cluster (C,D). So requirement of cluster and equally sized groups can not be satisfied at the same time.
0 讨论(0)
发布评论:

提交评论
- 加载中...