How to compute distances between centroids and data matrix (for kmeans algorithm)

前端 未结 2 922
一向
一向 2021-02-02 02:08

I am a student of clustering and R. In order to obtain a better grip of both I would like to compute the distance between centroids and my xy-matrix for each iteration till it \

2条回答
  •  遇见更好的自我
    2021-02-02 02:48

    Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers").

    For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix to all the "centers".

    Here is such a function:

    myEuclid <- function(points1, points2) {
        distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
        for(i in 1:nrow(points2)) {
            distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
        }
        distanceMatrix
    }
    

    points1 is the data matrix with points as rows and dimensions as columns. points2 is the matrix of centers (points as rows again). The first line of code just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there are centers). So the point i,j in the result matrix will be the distance from the ith point to the jth center.

    Then the for loop iterates over all centers. For each center it computes the euclidean distance from each point to the current center and returns the result. This line here: sqrt(rowSums(t(t(points1)-points2[i,])^2)) is euclidean distance. Inspect it closer and look up the formula if you have any troubles with that. (the transposes there are mainly done to make sure subtraction is being done row-wise).

    Now you can also implement k-means algorithm:

    myKmeans <- function(x, centers, distFun, nItter=10) {
        clusterHistory <- vector(nItter, mode="list")
        centerHistory <- vector(nItter, mode="list")
    
        for(i in 1:nItter) {
            distsToCenters <- distFun(x, centers)
            clusters <- apply(distsToCenters, 1, which.min)
            centers <- apply(x, 2, tapply, clusters, mean)
            # Saving history
            clusterHistory[[i]] <- clusters
            centerHistory[[i]] <- centers
        }
    
        list(clusters=clusterHistory, centers=centerHistory)
    }
    

    As you can see it's also a very simple function - it takes data matrix, centers, your distance function (the one defined above) and number of wanted iterations.

    The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center. Which is a basic k-means algorithm).

    Let's try it out. Define some random points (in 2d, so number of columns = 2)

    mat <- matrix(rnorm(100), ncol=2)
    

    Assign 5 random points from that matrix as initial centers:

    centers <- mat[sample(nrow(mat), 5),]
    

    Now run the algorithm:

    theResult <- myKmeans(mat, centers, myEuclid, 10)
    

    Here are the centers in the 10th iteration:

    theResult$centers[[10]]
            [,1]        [,2]
    1 -0.1343239  1.27925285
    2 -0.8004432 -0.77838017
    3  0.1956119 -0.19193849
    4  0.3886721 -1.80298698
    5  1.3640693 -0.04091114
    

    Compare that with implemented kmeans function:

    theResult2 <- kmeans(mat, centers, 10, algorithm="Forgy")
    
    theResult2$centers
            [,1]        [,2]
    1 -0.1343239  1.27925285
    2 -0.8004432 -0.77838017
    3  0.1956119 -0.19193849
    4  0.3886721 -1.80298698
    5  1.3640693 -0.04091114
    

    Works fine. Our function however tracks the iterations. We can plot the progress over the first 4 iterations like this:

    par(mfrow=c(2,2))
    for(i in 1:4) {
        plot(mat, col=theResult$clusters[[i]], main=paste("itteration:", i), xlab="x", ylab="y")
        points(theResult$centers[[i]], cex=3, pch=19, col=1:nrow(theResult$centers[[i]]))
    }
    

    Kmeans

    Nice.

    However this simple design allows for much more. For example if we want to use another kind of distance (not euclidean) we can just use any function that takes data and centers as inputs. Here is one for correlation distances:

    myCor <- function(points1, points2) {
        return(1 - ((cor(t(points1), t(points2))+1)/2))
    }
    

    And we then can do Kmeans based on those:

    theResult <- myKmeans(mat, centers, myCor, 10)
    

    The resulting picture for 4 iterations then looks like this:

    enter image description here

    Even thou we specified 5 clusters - there were 2 left at the end. That is because for 2 dimensions the correlation can have to values - either +1 or -1. Then when looking for the clusters each point get's assigned to one center, even if it has the same distance to multiple centers - the first one get's chosen.

    Anyway this is now getting out of scope. The bottom line is that there are many possible distance metrics and one simple function allows you to use any distance you want and track the results over iterations.

提交回复
热议问题