hierarchical-clustering

Dendextend: Regarding how to color a dendrogram’s labels according to defined groups

六月ゝ 毕业季﹏ 提交于 2019-12-10 18:22:50
问题 I'm trying to use an awesome R-package named dendextend, to plot a dendrogram and color its branches & labels according to a set of previously defined groups. I've read your answers in Stack Overflow, and the FAQs of dendextend vignette, but I'm still not sure on how to achieve my goal. Let's imagine I have a dataframe with a first column with the names of the individual to use for the clustering, then several columns with the factors to be analyzed, and the last column with the group

Alternative to scipy.cluster.hierarchy.cut_tree()

六月ゝ 毕业季﹏ 提交于 2019-12-10 17:12:12
问题 I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree() is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here). However, I need to be able to get a flat clustering with an assignment of k different labels to my datapoints. Do you know the algorithm to get a flat clustering with k labels from an arbitrary input linkage

Cutting Dendrogram/Clustering Tree from SciPy at distance height

吃可爱长大的小学妹 提交于 2019-12-10 16:25:31
问题 I'm trying to learn how to use dendrograms in Python using SciPy . I want to get clusters and be able to visualize them; I heard hierarchical clustering and dendrograms are the best way. How can I "cut" the tree at a specific distance? In this example, I just want to cut it at distance 1.6 I looked up a tutorial on https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Inconsistency-Method but the guy did some really confusing wrapper function using *

Memory Efficient Agglomerative Clustering with Linkage in Python

落爺英雄遲暮 提交于 2019-12-10 10:18:05
问题 I want to cluster 2d points (latitude/longitude) on a map. The number of points is 400K so the input matrix would be 400k x 2. When I run scikit-learn's Agglomerative Clustering I run out of memory and my memory is about 500GB. class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree='auto', linkage='ward', pooling_func=<function mean at 0x2b8085912398>)[source] I also tried the

How can I cluster thousands of documents using the R tm package?

…衆ロ難τιáo~ 提交于 2019-12-09 23:14:27
问题 I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make? Thank you for any help. make_clusters <- function(n) { require(tm) require(slam) docs <- unlist(lapply(letters

Why is the line of wss-plot (for optimizing the cluster analysis) looks so fluctuated?

↘锁芯ラ 提交于 2019-12-08 12:35:04
问题 I have a cluster plot by R while I want to optimize the "elbow criterion" of clustering with a wss plot, so I drew a wss plot for my cluster, but is looks really strange and I do not know how many elbows should I cluster, anyone could help me? Here is my data: Friendly<-c(0.533,0.854,0.9585,0.925,0.9125,0.9815,0.9645,0.981,0.9935,0.9585,0.996,0.956,0.9415) Polite<-c(0,0.45,0.977,0.9915,0.929,0.981,0.9895,0.9875,1,0.96,0.996,0.873,0.9125) Praising<-c(0,0,0.437,0.9585,0.9415,0.9605,0.998,0.998

Antipole Clustering

强颜欢笑 提交于 2019-12-08 09:39:47
问题 I made a photo mosaic script (PHP). This script has one picture and changes it to a photo buildup of little pictures. From a distance it looks like the real picture, when you move closer you see it are all little pictures. I take a square of a fixed number of pixels and determine the average color of that square. Then I compare this with my database which contains the average color of a couple thousand of pictures. I determine the color distance with all available images. But to run this

HDBSCAN Python choose number of clusters

十年热恋 提交于 2019-12-08 03:56:23
问题 Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size? Thanks UPDATE: here is the code to use fcluster and hdbscan import hdbscan from scipy.cluster.hierarchy import fcluster clusterer = hdbscan.HDBSCAN() clusterer.fit(X) Z = clusterer.single_linkage_tree_.to_numpy() labels = fcluster(Z, 2, criterion='maxclust') 回答1: If you explicitly need to get a fixed number of clusters

Hierarchical Agglomerative clustering in Spark

﹥>﹥吖頭↗ 提交于 2019-12-07 20:24:23
问题 I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods. I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information. If anyone has some insight about it, I would be very grateful. Thank you. 回答1: The Bisecting Kmeans Approach Seems to do a decent job, and runs quite fast in terms of performance. Here is a

Pruning dendrogram at levels in Scipy Hierarchical Clustering

家住魔仙堡 提交于 2019-12-07 18:27:59
问题 I have lot of data points which are clustered in the following way using Scipy Hierarchical Clustering. Let's say I want to prune the dendogram at level '1500'? How to do that? (I've tried using 'p' parameter and that is not what I'm expecting) Z = dendrogram(linkage_matrix, truncate_mode='lastp', color_threshold=1, labels=df.session.tolist(), distance_sort='ascending') plt.title("Hierachical Clustering") plt.show() 回答1: As specified in the scipy documentation, if a cluster node is under