Refitting clusters around fixed centroids

雨燕双飞 提交于 2020-02-02 19:35:30

问题


Clustering/classification problem: Used k-means clustering to generate these clusters and centroids:

This is the dataset with the added cluster attribute from the initial run:

  > dput(sampledata)
    structure(list(Player = structure(1:5, .Label = c("A", "B", "C", 
    "D", "E"), class = "factor"), Metric.1 = c(0.3938961, 0.28062338, 
    0.32532626, 0.29239642, 0.25622558), Metric.2 = c(0.00763359, 
    0.01172354, 0.40550867, 0.04026846, 0.05976367), Metric.3 = c(0.50766075, 
    0.20345662, 0.06267444, 0.08661417, 0.17588925), cluster = c(1L, 
    2L, 3L, 2L, 2L)), .Names = c("Player", "Metric.1", "Metric.2", 
    "Metric.3", "cluster"), row.names = c(NA, -5L), class = "data.frame")

These are the cluster details ran off the 3 metrics:

> dput (scluster)
structure(list(cluster = c(1L, 2L, 3L, 2L, 2L), centers = structure(c(0.3938961, 
0.276415126666667, 0.32532626, 0.00763359, 0.03725189, 0.40550867, 
0.50766075, 0.155320013333333, 0.06267444), .Dim = c(3L, 3L), .Dimnames = list(
    c("1", "2", "3"), c("Metric.1", "Metric.2", "Metric.3"))), 
    totss = 0.252759813332907, withinss = c(0, 0.00930902482096013, 
    0), tot.withinss = 0.00930902482096013, betweenss = 0.243450788511947, 
    size = c(1L, 3L, 1L), iter = 1L, ifault = 0L), .Names = c("cluster", 
"centers", "totss", "withinss", "tot.withinss", "betweenss", 
"size", "iter", "ifault"), class = "kmeans")

Data with cluster attribute and centroids

I aim to find a way to fix these centroids after the first cluster run for each cluster, such that these centroids can be used as fixed future references to see how these players move in and out of these clusters to different clusters if their metrics change, thereby tracking their progress or regress.

Specifically, if player A has changes in metrics such that it now resembles cluster 2 rather than 1, based on the Euclidean distance from the respective fixed centroids, I should be able to see player A move to cluster 2. This would mean the data points were refitted around these initially fixed centroids obtained from the first run.

This should help users to know how to approach such a data mining problem. Any pointers would be greatly appreciated! Thank you.


回答1:


Here you go:

# install a couple of packages needed for the example
library(devtools)
devtools::install_github("alexwhitworth/emclustr")
devtools::install_github("alexwhitworth/imputation")
library(emclustr)
library(imputation)

# generate some example data -- 30 points in 3 2-dimensional clusters
# clusters are MVN
set.seed(123)
x <- rbind(gen_clust(10, 2, c(-5,5), c(1,1)),
           gen_clust(10, 2, c(0,0), c(1,1)),
           gen_clust(10, 2, c(5,5), c(1,1)))

# get initial centroids
km <- kmeans(x, centers= 3)$centers

# generate a new set of example data, in this case a "subsequent step"
# from your time-series
x2 <- rbind(gen_clust(10, 2, c(-4,-4), c(1,1)),
           gen_clust(10, 2, c(1,1), c(1,1)),
           gen_clust(10, 2, c(4,4), c(1,1)))

# calculate the Euclidean distance of each point to each centroid
# and evaluate nearest distance
d_km <- as.data.frame(cbind(dist_q.matrix(x= rbind(km[1,], x2), ref= 1L, q=2),
              dist_q.matrix(x= rbind(km[2,], x2), ref= 1L, q=2),
              dist_q.matrix(x= rbind(km[3,], x2), ref= 1L, q=2)))
names(d_km) <- c("dist_centroid1", "dist_centroid2", "dist_centroid3")
d_km$clust <- apply(d_km, 1, which.min)

# plot the centroids and the new points "x2" to show the results
plot(km, pch= 11, xlim= c(-6,6), ylim= c(-6,6))
points(x2, col= factor(d_km$clust))



来源:https://stackoverflow.com/questions/33399000/refitting-clusters-around-fixed-centroids

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!