hierarchical-clustering

sklearn agglomerative clustering with distance linkage criterion

限于喜欢 提交于 2019-12-07 18:06:54
问题 I usually use scipy.cluster.hierarchical linkage and fcluster functions to get cluster labels. However, the sklearn.cluster.AgglomerativeClustering has the ability to also consider structural information using a connectivity matrix , for example using a knn_graph input, which makes it interesting for my current application. However, I usually assign labels in fcluster by either a 'distance' or 'inconsistent' criterion, and AFAIK the AgglomerativeClustering function in sklearn only has the

Color dendrogram branches based on external labels uptowards the root until the label matches

喜欢而已 提交于 2019-12-07 14:42:31
问题 From question Color branches of dendrogram using an existing column, I can color the branches near the leaf of the dendrogram. The code: x<-1:100 dim(x)<-c(10,10) set.seed(1) groups<-c("red","red", "red", "red", "blue", "blue", "blue","blue", "red", "blue") x.clust<-as.dendrogram(hclust(dist(x))) x.clust.dend <- x.clust labels_colors(x.clust.dend) <- groups x.clust.dend <- assign_values_to_leaves_edgePar(x.clust.dend, value = groups, edgePar = "col") # add the colors. x.clust.dend <- assign

Hierarchical Agglomerative clustering in Spark

佐手、 提交于 2019-12-06 14:16:55
I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods. I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information. If anyone has some insight about it, I would be very grateful. Thank you. Gabe Church The Bisecting Kmeans Approach Seems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster

How to hierarchically cluster a data matrix in R?

点点圈 提交于 2019-12-06 12:02:52
问题 I am trying to cluster a data matrix produced from scientific data. I know how I want the clustering done, but am not sure how to accomplish this feat in R. Here is what the data looks like: A1 A2 A3 B1 B2 B3 C1 C2 C3 sample1 1 9 10 2 1 29 2 5 44 sample2 8 1 82 2 8 2 8 2 28 sample3 9 9 19 2 8 1 7 2 27 Please consider A1,A2,A3 to be three replicates of a single treatment, and likewise with B and C. Sample1 are different tested variables. So, I want to hierarchically cluster this matrix in

Pruning dendrogram at levels in Scipy Hierarchical Clustering

烂漫一生 提交于 2019-12-06 09:23:20
I have lot of data points which are clustered in the following way using Scipy Hierarchical Clustering. Let's say I want to prune the dendogram at level '1500'? How to do that? (I've tried using 'p' parameter and that is not what I'm expecting) Z = dendrogram(linkage_matrix, truncate_mode='lastp', color_threshold=1, labels=df.session.tolist(), distance_sort='ascending') plt.title("Hierachical Clustering") plt.show() As specified in the scipy documentation , if a cluster node is under color_threshold , then all of its descendants will be the same color (not blue). The links connecting nodes

Extract the hierarchical structure of the nodes in a dendrogram or cluster

强颜欢笑 提交于 2019-12-06 06:00:44
问题 I would like to extract the hierarchical structure of the nodes of a dendrogram or cluster. For example in the next example: library(dendextend) dend15 <- c(1:5) %>% dist %>% hclust(method = "average") %>% as.dendrogram dend15 %>% plot The nodes are classified according their position in the dendrogram (see figure below) (Figure extracted from the dendextend package's tutorial) I would like to get all the nodes for each final leaf as the next output: (the labels are ordered from left to right

Interpreting the output of SciPy's hierarchical clustering dendrogram? (maybe found a bug…)

末鹿安然 提交于 2019-12-06 04:10:18
I am trying to figure out how the output of scipy.cluster.hierarchy.dendrogram works... I thought I knew how it worked and I was able to use the output to reconstruct the dendrogram but it seems as if I am not understanding it anymore or there is a bug in Python 3 's version of this module. This answer, how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy , implies that the dendrogram output dictionary gives dict_keys(['icoord', 'ivl', 'color_list', 'leaves', 'dcoord']) w/ all of the same size so you can zip them and plt.plot them to reconstruct the dendrogram. Seems simple

sklearn agglomerative clustering with distance linkage criterion

痴心易碎 提交于 2019-12-06 01:39:49
I usually use scipy.cluster.hierarchical linkage and fcluster functions to get cluster labels. However, the sklearn.cluster.AgglomerativeClustering has the ability to also consider structural information using a connectivity matrix , for example using a knn_graph input, which makes it interesting for my current application. However, I usually assign labels in fcluster by either a 'distance' or 'inconsistent' criterion, and AFAIK the AgglomerativeClustering function in sklearn only has the option to define the number of desired clusters (so criterion='maxclust' in the scipy library). I am

Memory Efficient Agglomerative Clustering with Linkage in Python

陌路散爱 提交于 2019-12-05 21:48:28
I want to cluster 2d points (latitude/longitude) on a map. The number of points is 400K so the input matrix would be 400k x 2. When I run scikit-learn's Agglomerative Clustering I run out of memory and my memory is about 500GB. class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree='auto', linkage='ward', pooling_func=<function mean at 0x2b8085912398>)[source] I also tried the memory=Memory(cachedir) option with no success. Does anybody have a suggestion (another library or change

How to know about group information in cluster analysis (hierarchical)?

南笙酒味 提交于 2019-12-05 05:47:50
问题 I have problem about group in cluster analysis(hierarchical cluster). As example, this is the dendrogram of complete linkage of Iris data set . After I use > table(cutree(hc, 3), iris$Species) This is the output : setosa versicolor virginica 1 50 0 0 2 0 23 49 3 0 27 1 I have read in one statistical website that, object 1 in the data always belongs to group/cluster 1. From the output above, we know that setosa is in group 1 . Then, how I am going to know about the other two species. How do