hierarchical-clustering

How to calculate clustering entropy? A working example or software code [closed]

六眼飞鱼酱① 提交于 2019-12-18 12:02:41
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I would like to calculate entropy of this example scheme http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Can anybody please explain step by step with real values? I know there are unliminted number of formulas but i am really bad at understanding formulas :) For example in the

Text clustering with Levenshtein distances

我们两清 提交于 2019-12-17 10:17:22
问题 I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means. Although I get the problem in its abstract form, I do not know what is the easie way to actually do

Use Distance Matrix in scipy.cluster.hierarchy.linkage()?

风流意气都作罢 提交于 2019-12-17 06:34:25
问题 I have a distance matrix n*n M where M_ij is the distance between object_i and object_j . So as expected, it takes the following form: / 0 M_01 M_02 ... M_0n\ | M_10 0 M_12 ... M_1n | | M_20 M_21 0 ... M2_n | | ... | \ M_n0 M_n2 M_n2 ... 0 / Now I wish to cluster these n objects with hierarchical clustering. Python has an implementation of this called scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean') . Its documentation says: y must be a {n \choose 2} sized vector where

Parallel construction of a distance matrix

坚强是说给别人听的谎言 提交于 2019-12-14 04:28:05
问题 I work on hierarchical agglomerative clustering on large amounts of multidimensional vectors, and I noticed that the biggest bottleneck is the construction of the distance matrix. A naive implementation for this task is the following (here in Python): ''' v = an array (N,d), where rows are the observations and columns the dimensions''' def create_dist_matrix(v): N = v.shape[0] D = np.zeros((N,N)) for i in range(N): for j in range(i+1): D[i,j] = cosine(v[i,:],v[j,:]) # scipy.spatial.distance

Obtain the Clustered Documents of DBSCAN

青春壹個敷衍的年華 提交于 2019-12-13 11:24:11
问题 I attempted to use DBSCAN (from scikit-learn) to cluster text documents. I use TF-IDF (TfidfVectorizer in sklearn) to create the feature of each document. However, I have not found a way to obtain (print) the documents that are clustered by DBSCAN. The DBSCAN in sklearn, provides an attribute called 'labels_' which allows us to get the cluster group labels (e.g. 1, 2, 3, -1 for noise). But, I want to get the documents that are clustered by DBSCAN, instead of the cluster group labels. To

Label Ordering in Scipy Dendrogram

人盡茶涼 提交于 2019-12-13 03:17:22
问题 In python, I have an N by N distance matrix dmat, where dmat[i,j] encodes the distance from entity i to entity j. I'd like to view a dendrogram. I did: from scipy.cluster.hierarchy import dendrogram, linkage import matplotlib.pylab as plt labels=[name of entity 1,2,3,...] Z=linkage(dmat) dn=dendrogram(Z,labels=labels) plt.show() But the label ordering looks wrong. There are entities which are very close from dmat, but that's not reflected in the dendrogram. What's going on? 回答1: The first

How to change the label size of an R plot

六月ゝ 毕业季﹏ 提交于 2019-12-13 01:26:12
问题 I`m making a cluster plot from my data. I have the entire plot finished but my label text is to large to be able to properly read the plot. Anyone any idea how to make labels smaller. I am using the package "sparcl", and my function is: ColorDendrogram(fit,y=col.int, main = "Clusters from 216 samples", branchlength = 0.20, labels = fit$labels, xlab = NULL, sub = NULL, ylab = "", cex.main = NULL) as you can see the branch text is to big and they fall over each other. I want the text of the

OpenCV 2.4.5: FLANN and hierarchicalClustering

一曲冷凌霜 提交于 2019-12-12 20:54:34
问题 I have recently started porting an application to a new platform which runs OpenCV 2.4.5. Part of my code which uses OpenCV's implementation of FLANN to do hierarchical clustering no longer compiles. The code is as follows: cv::Mat mergedFeatures = cvCreateMat(descriptorTotal, descriptorDims, CV_32F); int counter = 0; for (uint j = 0; j < ImageFeatures.size(); j++) { cv::Mat features = ImageFeatures[j]; for (int k = 0; k < features.rows; k++) { cv::Mat roi = mergedFeatures.row(counter);

Plotting hierarchical clustering dendrograms for large data sets

守給你的承諾、 提交于 2019-12-12 06:56:33
问题 I have a huge data set of time series data. In order to visualise the clustering in python, I want to plot time series graphs along with the dendrogram as shown below. I tried to do it by using subgrid2plot() function in python by creating two subplots side by side. I filled first one with series graphs and second one with dendrograms. But once number of time series increased, it became blur. Can someone suggest a nice way to plot this type of dendrogram? I have around 50,000 time series to

Generating a co-occurrance matrix in R on a LARGE dataset

妖精的绣舞 提交于 2019-12-12 06:13:12
问题 I'm trying to create a co-occurrence matrix in R on a very large dataset (26M lines) that looks basically like this: ID Observation 11000 ficus 11112 cherry 11112 ficus 12223 juniper 12223 olive 12223 juniper 12223 ficus 12334 olive 12334 cherry 12334 olive ... ... And on for a long time. I want to consolidate the observations by ID and generate a co-occurance matrix of observations observed by observer ID. I managed this on a subset of the data but some of the stuff I did "manually" that it