R cluster analysis and dendrogram with correlation matrix

前端 未结 2 1967
慢半拍i
慢半拍i 2021-01-15 06:17

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.

corloads = cor(df1[,2:185],          


        
相关标签:
2条回答
  • 2021-01-15 06:54

    I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package. Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )

    0 讨论(0)
  • 2021-01-15 07:01

    To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.

    The kgs is helpful to get the optimal number of clusters.

    Following your code one would do:

    clus <- hclust(distance)
    op_k <- kgs(clus, distance, maxclus = 20)
    plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
    

    So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot. You can get it with

    min(op_k)
    

    Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.

    Check this page for more methods.

    Hope it helps you.

    Edit

    To find which is the optimal number of clusters, you can do

    op_k[which(op_k == min(op_k))]
    

    Plus

    Also see this post to find the perfect graphy answer from @Ben

    Edit

    op_k[which(op_k == min(op_k))]
    

    still gives penalty. To find the optimal number of clusters, use

    as.integer(names(op_k[which(op_k == min(op_k))]))
    
    0 讨论(0)
提交回复
热议问题