Refitting clusters around fixed centroids

问题

Clustering/classification problem: Used k-means clustering to generate these clusters and centroids:

This is the dataset with the added cluster attribute from the initial run:

  > dput(sampledata)
    structure(list(Player = structure(1:5, .Label = c("A", "B", "C", 
    "D", "E"), class = "factor"), Metric.1 = c(0.3938961, 0.28062338, 
    0.32532626, 0.29239642, 0.25622558), Metric.2 = c(0.00763359, 
    0.01172354, 0.40550867, 0.04026846, 0.05976367), Metric.3 = c(0.50766075, 
    0.20345662, 0.06267444, 0.08661417, 0.17588925), cluster = c(1L, 
    2L, 3L, 2L, 2L)), .Names = c("Player", "Metric.1", "Metric.2", 
    "Metric.3", "cluster"), row.names = c(NA, -5L), class = "data.frame")

These are the cluster details ran off the 3 metrics:

> dput (scluster)
structure(list(cluster = c(1L, 2L, 3L, 2L, 2L), centers = structure(c(0.3938961, 
0.276415126666667, 0.32532626, 0.00763359, 0.03725189, 0.40550867, 
0.50766075, 0.155320013333333, 0.06267444), .Dim = c(3L, 3L), .Dimnames = list(
    c("1", "2", "3"), c("Metric.1", "Metric.2", "Metric.3"))), 
    totss = 0.252759813332907, withinss = c(0, 0.00930902482096013, 
    0), tot.withinss = 0.00930902482096013, betweenss = 0.243450788511947, 
    size = c(1L, 3L, 1L), iter = 1L, ifault = 0L), .Names = c("cluster", 
"centers", "totss", "withinss", "tot.withinss", "betweenss", 
"size", "iter", "ifault"), class = "kmeans")

Data with cluster attribute and centroids

I aim to find a way to fix these centroids after the first cluster run for each cluster, such that these centroids can be used as fixed future references to see how these players move in and out of these clusters to different clusters if their metrics change, thereby tracking their progress or regress.

Specifically, if player A has changes in metrics such that it now resembles cluster 2 rather than 1, based on the Euclidean distance from the respective fixed centroids, I should be able to see player A move to cluster 2. This would mean the data points were refitted around these initially fixed centroids obtained from the first run.

This should help users to know how to approach such a data mining problem. Any pointers would be greatly appreciated! Thank you.

回答1:

Here you go:

# install a couple of packages needed for the example
library(devtools)
devtools::install_github("alexwhitworth/emclustr")
devtools::install_github("alexwhitworth/imputation")
library(emclustr)
library(imputation)

# generate some example data -- 30 points in 3 2-dimensional clusters
# clusters are MVN
set.seed(123)
x <- rbind(gen_clust(10, 2, c(-5,5), c(1,1)),
           gen_clust(10, 2, c(0,0), c(1,1)),
           gen_clust(10, 2, c(5,5), c(1,1)))

# get initial centroids
km <- kmeans(x, centers= 3)$centers

# generate a new set of example data, in this case a "subsequent step"
# from your time-series
x2 <- rbind(gen_clust(10, 2, c(-4,-4), c(1,1)),
           gen_clust(10, 2, c(1,1), c(1,1)),
           gen_clust(10, 2, c(4,4), c(1,1)))

# calculate the Euclidean distance of each point to each centroid
# and evaluate nearest distance
d_km <- as.data.frame(cbind(dist_q.matrix(x= rbind(km[1,], x2), ref= 1L, q=2),
              dist_q.matrix(x= rbind(km[2,], x2), ref= 1L, q=2),
              dist_q.matrix(x= rbind(km[3,], x2), ref= 1L, q=2)))
names(d_km) <- c("dist_centroid1", "dist_centroid2", "dist_centroid3")
d_km$clust <- apply(d_km, 1, which.min)

# plot the centroids and the new points "x2" to show the results
plot(km, pch= 11, xlim= c(-6,6), ylim= c(-6,6))
points(x2, col= factor(d_km$clust))

来源：https://stackoverflow.com/questions/33399000/refitting-clusters-around-fixed-centroids

标签

classification

cluster-analysis

data-mining