How Could One Implement the K-Means++ Algorithm?

前端 未结 3 1246
無奈伤痛
無奈伤痛 2020-12-04 08:45

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the

相关标签:
3条回答
  • 2020-12-04 09:02

    Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

    In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

    Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

    Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

    I've coded the initialization procedure in Python; I don't know if this helps you.

    def initialize(X, K):
        C = [X[0]]
        for k in range(1, K):
            D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
            probs = D2/D2.sum()
            cumprobs = probs.cumsum()
            r = scipy.rand()
            for j,p in enumerate(cumprobs):
                if r < p:
                    i = j
                    break
            C.append(X[i])
        return C
    

    EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

    Example:

    probs = [0.1, 0.2, 0.3, 0.4]
    cumprobs = [0.1, 0.3, 0.6, 1.0]
    if r < cumprobs[0]:
        # this event has probability 0.1
        i = 0
    elif r < cumprobs[1]:
        # this event has probability 0.2
        i = 1
    elif r < cumprobs[2]:
        # this event has probability 0.3
        i = 2
    elif r < cumprobs[3]:
        # this event has probability 0.4
        i = 3
    
    0 讨论(0)
  • 2020-12-04 09:05

    One Liner.

    Say we need to select 2 cluster centers, instead of selecting them all randomly{like we do in simple k means}, we will select the first one randomly, then find the points that are farthest to the first center{These points most probably do not belong to the first cluster center as they are far from it} and assign the second cluster center nearby those far points.

    0 讨论(0)
  • 2020-12-04 09:29

    I have prepared a full source implementation of k-means++ based on the book "Collective Intelligence" by Toby Segaran and the k-menas++ initialization provided here.

    Indeed there are two distance functions here. For the initial centroids a standard one is used based numpy.inner and then for the centroids fixation the Pearson one is used. Maybe the Pearson one can be also be used for the initial centroids. They say it is better.

    from __future__ import division
    
    def readfile(filename):
      lines=[line for line in file(filename)]
      rownames=[]
      data=[]
      for line in lines:
        p=line.strip().split(' ') #single space as separator
        #print p
        # First column in each row is the rowname
        rownames.append(p[0])
        # The data for this row is the remainder of the row
        data.append([float(x) for x in p[1:]])
        #print [float(x) for x in p[1:]]
      return rownames,data
    
    from math import sqrt
    
    def pearson(v1,v2):
      # Simple sums
      sum1=sum(v1)
      sum2=sum(v2)
    
      # Sums of the squares
      sum1Sq=sum([pow(v,2) for v in v1])
      sum2Sq=sum([pow(v,2) for v in v2])    
    
      # Sum of the products
      pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
    
      # Calculate r (Pearson score)
      num=pSum-(sum1*sum2/len(v1))
      den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
      if den==0: return 0
    
      return 1.0-num/den
    
    import numpy
    from numpy.random import *
    
    def initialize(X, K):
        C = [X[0]]
        for _ in range(1, K):
            #D2 = numpy.array([min([numpy.inner(c-x,c-x) for c in C]) for x in X])
            D2 = numpy.array([min([numpy.inner(numpy.array(c)-numpy.array(x),numpy.array(c)-numpy.array(x)) for c in C]) for x in X])
            probs = D2/D2.sum()
            cumprobs = probs.cumsum()
            #print "cumprobs=",cumprobs
            r = rand()
            #print "r=",r
            i=-1
            for j,p in enumerate(cumprobs):
                if r 0:
            for rowid in bestmatches[i]:
              for m in range(len(rows[rowid])):
                avgs[m]+=rows[rowid][m]
            for j in range(len(avgs)):
              avgs[j]/=len(bestmatches[i])
            clusters[i]=avgs
    
      return bestmatches
    
    rows,data=readfile('/home/toncho/Desktop/data.txt')
    
    kclust = kcluster(data,k=4)
    
    print "Result:"
    for c in kclust:
        out = ""
        for r in c:
            out+=rows[r] +' '
        print "["+out[:-1]+"]"
    
    print 'done'
    

    data.txt:

    p1 1 5 6
    p2 9 4 3
    p3 2 3 1
    p4 4 5 6
    p5 7 8 9
    p6 4 5 4
    p7 2 5 6
    p8 3 4 5
    p9 6 7 8
    

    0 讨论(0)
提交回复
热议问题