How Could One Implement the K-Means++ Algorithm?

前端未结

关注

 3  1246

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the

相关标签:

3条回答

爱一瞬间的悲伤

2020-12-04 09:02
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

I've coded the initialization procedure in Python; I don't know if this helps you.
```
def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C
```
EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

Example:
```
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2020-12-04 09:05

One Liner.

Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.

0 讨论(0)
发布评论:

提交评论
- 加载中...

耶瑟儿～

2020-12-04 09:29

I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.

Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.

from __future__ import division

def readfile(filename):
  lines=[line for line in file(filename)]
  rownames=[]
  data=[]
  for line in lines:
    p=line.strip().split(' ') #single space as separator
    #print p
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    data.append([float(x) for x in p[1:]])
    #print [float(x) for x in p[1:]]
  return rownames,data

from math import sqrt

def pearson(v1,v2):
  # Simple sums
  sum1=sum(v1)
  sum2=sum(v2)

  # Sums of the squares
  sum1Sq=sum([pow(v,2) for v in v1])
  sum2Sq=sum([pow(v,2) for v in v2])    

  # Sum of the products
  pSum=sum([v1[i]*v2[i] for i in range(len(v1))])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/len(v1))
  den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
  if den==0: return 0

  return 1.0-num/den

import numpy
from numpy.random import *

def initialize(X, K):
    C = [X[0]]
    for _ in range(1, K):
        #D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
        D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        #print "cumprobs=",cumprobs
        r = rand()
        #print "r=",r
        i=-1
        for j,p in enumerate(cumprobs):
            if r 0:
        for rowid in bestmatches[i]:
          for m in range(len(rows[rowid])):
            avgs[m]+=rows[rowid][m]
        for j in range(len(avgs)):
          avgs[j]/=len(bestmatches[i])
        clusters[i]=avgs

  return bestmatches

rows,data=readfile('/home/toncho/Desktop/data.txt')

kclust = kcluster(data,k=4)

print "Result:"
for c in kclust:
    out = ""
    for r in c:
        out+=rows[r] +' '
    print "["+out[:-1]+"]"

print 'done'

data.txt:

0 讨论(0)