问题
Is there an easy way to calculate lowest value of h
in cut
that produces groupings of a given minimum size?
In this example, if I wanted clusters with at least ten members each, I should go with h = 3.80
:
# using iris data simply for reproducible example
data(iris)
d <- data.frame(scale(iris[,1:4]))
hc <- hclust(dist(d))
plot(hc)
cut(as.dendrogram(hc), h=3.79) # produces 5 groups; group 4 has 7 members
cut(as.dendrogram(hc), h=3.80) # produces 4 groups; no group has <10 members
Since the heights of the splits are given in hc$height
, I could create a set of candidate values using hc$height + 0.00001
and then loop through cuts at each of them. However, I don't see how to parse the cluster size members
out of the dendrogram
class. For example, cut(as.dendrogram(hc), h=3.80)$lower[[1]]$members
returns NULL
, not 66 as desired.
Please note that this is a simpler question than Cutting dendrogram into n trees with minimum cluster size in R which uses the package dynamicTreeCut
; here I am not specifying number of trees, just minimum cluster size. TYVM.
回答1:
Thanks to @Vlo and @lukeA I'm able to implement a loop. However, I am just posting this for a starting point and certainly open to a more elegant solution.
unnest <- function(x) { # from Vlo's answer
if(is.null(names(x))) x
else c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))
}
cuts <- hc$height + 1e-9
min_size <- 10
smallest <- 0
i <- 0
while(smallest < min_size & i <= length(cuts)){
h_i <- cuts[i <- i+1]
if(i > length(cuts)){
warning("Couldn't find a cluster big enough.")
}
else smallest <-
Reduce(min,
lapply(X = unnest(cut(as.dendrogram(hc), h=h_i)$lower),
FUN = attr, which = "members") ) # from lukeA's comment
}
h_i # returns desired output: [1] 3.79211
回答2:
This feature is available in the dendextend package with the heights_per_k.dendrogram
function (which also has a faster C++ implementation when loading the dendextendRcpp function).
## Not run:
hc <- hclust(dist(USArrests[1:4,]), "ave")
dend <- as.dendrogram(hc)
heights_per_k.dendrogram(dend)
## 1 2 3 4
##86.47086 68.84745 45.98871 28.36531
As a sidenote, the dendextend package has a cutree.dendrogram
S3 method for dendrograms (which works very similarly to cutree for hclust objects).
回答3:
This doesn't answer the question, but might be useful for members
extraction if you decide to loop through the h
.
Stealing and modifying some code from here
# Unnest the list/dendogram structure
unnest <- function(x) {
if(is.null(names(x))) {
x
}
else {
c(list(all=unname(unlist(x))), do.call(c, lapply(x, unnest)))
}
}
# Extract the `members` attribute from each dendogram
lapply(X = unnest(cut(as.dendrogram(hc), h=3.8)), FUN = attr, which = "members")
Output:
# Please don't ask me why there are 2 dendograms stored
# in the `$upper` list while `print` displays one
$upper1
[1] 2
$upper2
[1] 2
$lower1
[1] 66
$lower2
[1] 11
$lower3
[1] 24
$lower4
[1] 49
来源:https://stackoverflow.com/questions/31124810/r-cut-dendrogram-into-groups-with-minimum-size