I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.
corloads = cor(df1[,2:185],
I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package. Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.
The kgs
is helpful to get the optimal number of clusters.
Following your code one would do:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs
function is the minimum value of op_k
, as you can see in the plot.
You can get it with
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL
.
Check this page for more methods.
Hope it helps you.
To find which is the optimal number of clusters, you can do
op_k[which(op_k == min(op_k))]
Also see this post to find the perfect graphy answer from @Ben
op_k[which(op_k == min(op_k))]
still gives penalty. To find the optimal number of clusters, use
as.integer(names(op_k[which(op_k == min(op_k))]))