Clustering - how to find the nearest to a cluster

喜夏-厌秋 提交于 2019-12-25 18:44:36

问题


Hints I got as to a different question puzzled me quite a bit.

I got an exercise, actually part of a larger exercise:

  1. Cluster some data, using hclust (done)
  2. Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.

According to the excercise, this should be done in quite short a time.

However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.

As I suppose I was unclear:

Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?


回答1:


You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.

# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]

# Put the B data into 10 clusters
hc   <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]

# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL

# Now add the hold out state to the set of averages
M <-rbind(M,KY)

# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust  = which.min(D[-length(D)])
memb[memb==KYclust]

# Now cluster the full set of states and compare the results.  
hc   <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]



回答2:


In contrast to k-means, clusters found by hclust can be of arbitrary shape.

The distance to the nearest cluster center therefore is not always meaningful.

Doing a 1 nearest neighbor style assignment probably is better.



来源:https://stackoverflow.com/questions/18663044/clustering-how-to-find-the-nearest-to-a-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!