How Could One Implement the K-Means++ Algorithm?

末鹿安然 提交于 2020-01-19 06:33:12

问题


I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

  1. Is the probability function used based on distance or Gaussian?
  2. In the same time the most long distant point (From the other centroids) is picked for a new centroid.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.


回答1:


Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

Example:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3



回答2:


One Liner.

Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.




回答3:


I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.

Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.

from __future__ import division

def readfile(filename):
  lines=[line for line in file(filename)]
  rownames=[]
  data=[]
  for line in lines:
    p=line.strip().split(' ') #single space as separator
    #print p
    # First column in each row is the rowname
    rownames.append(p[0])
    # The data for this row is the remainder of the row
    data.append([float(x) for x in p[1:]])
    #print [float(x) for x in p[1:]]
  return rownames,data

from math import sqrt

def pearson(v1,v2):
  # Simple sums
  sum1=sum(v1)
  sum2=sum(v2)

  # Sums of the squares
  sum1Sq=sum([pow(v,2) for v in v1])
  sum2Sq=sum([pow(v,2) for v in v2])    

  # Sum of the products
  pSum=sum([v1[i]*v2[i] for i in range(len(v1))])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/len(v1))
  den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
  if den==0: return 0

  return 1.0-num/den

import numpy
from numpy.random import *

def initialize(X, K):
    C = [X[0]]
    for _ in range(1, K):
        #D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
        D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        #print "cumprobs=",cumprobs
        r = rand()
        #print "r=",r
        i=-1
        for j,p in enumerate(cumprobs):
            if r 0:
        for rowid in bestmatches[i]:
          for m in range(len(rows[rowid])):
            avgs[m]+=rows[rowid][m]
        for j in range(len(avgs)):
          avgs[j]/=len(bestmatches[i])
        clusters[i]=avgs

  return bestmatches

rows,data=readfile('/home/toncho/Desktop/data.txt')

kclust = kcluster(data,k=4)

print "Result:"
for c in kclust:
    out = ""
    for r in c:
        out+=rows[r] +' '
    print "["+out[:-1]+"]"

print 'done'

data.txt:

p1 1 5 6
p2 9 4 3
p3 2 3 1
p4 4 5 6
p5 7 8 9
p6 4 5 4
p7 2 5 6
p8 3 4 5
p9 6 7 8



来源:https://stackoverflow.com/questions/5466323/how-could-one-implement-the-k-means-algorithm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!