hierarchical-clustering

Drawbacks of K-Medoid (PAM) Algorithm

社会主义新天地 提交于 2019-12-12 03:55:53
问题 I have researched that K-medoid Algorithm (PAM) is a parition-based clustering algorithm and a variant of K-means algorithm. It has solved the problems of K-means like producing empty clusters and the sensitivity to outliers/noise. However, the time complexity of K-medoid is O(n^2), unlike K-means (Lloyd's Algorithm) which has a time complexity of O(n). I would like to ask if there are other drawbacks of K-medoid algorithm aside from its time complexity. 回答1: The main disadvantage of K-Medoid

How to calculate clustering entropy - example and my solution given but is it correct? [closed]

只谈情不闲聊 提交于 2019-12-12 03:35:13
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I would like to calculate entropy of this example scheme http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Equation of entropy Then the entropy is (the first line) So entropy is for this scheme For the first cluster - ( (5/6)*Log(5/6) + (1/6)*Log(1/6) ) For the second cluster - (

How to clustering syllable types with python?

痞子三分冷 提交于 2019-12-12 02:47:28
问题 This is my second question in stack overflow. I don't have to much experience with python, but had excellent results with my first question and I was able to implement the code from the answer, so I will try again with this new problem: I am trying to classify syllable types from a canary song, in order to use each types as templates to find and classify large sets of data with similar behavior. I use the envelope of the singing. My data is a sampled array, with time and amplitude (a plot of

Cutting dendrogram at highest level of purity

坚强是说给别人听的谎言 提交于 2019-12-11 23:34:03
问题 I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity. So following is the algorithm I am working on right now. Create dedrogram for the documents in the dataset purity = 0 final_clusters for all the levels, lvl, in the dendrogram clusters = cut dendrogram at lvl new_purity = calculate_purity_of(clusters) if new_purity > purity purity = new

R how to select several rows to make a new dataframe

♀尐吖头ヾ 提交于 2019-12-11 16:41:31
问题 I have a dataframe of more than 5000 observations. In my attempt of analysing my data using hierarchical clustering, I have 8 clusters, where some rows contain either a few 1000 or 100 observations. # Cut tree into 8 groups cutree_hclust <- cutree(hclust.unsupervised, k = 8) # Number of members in each cluster table(cutree_hclust) cutree_hclust 1 2 3 4 5 6 7 8 486 61 14 3 15 2 9 5 To get a view of what variable combination there is for each observation in the different clusters, I thought

Python alternate way to find dendrogram

让人想犯罪 __ 提交于 2019-12-11 14:11:20
问题 I have data of dimension 8000x100. I need to cluster these 8000 items. I am more interested in the ordering of these items. I could get the desired result from the above code for small data but for higher dimension, I keep getting runtime error "RuntimeError: maximum recursion depth exceeded while getting the str of an object". Is there an alternate way to to get the reordered column from "Z". from hcluster import pdist, linkage, dendrogram import numpy from numpy.random import rand x = rand

basic clustering with r

若如初见. 提交于 2019-12-11 11:16:36
问题 I'm new to R and data analysis. I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price which users clicked at. c165c2ee-81cf-48cf-ba3f-83b70204c00c 161785 124.0 a886fdd5-7cee-4152-b1b7-77a2702687b0 643339 42.0 5e5fd670-b104-445b-a36d-b3798cd43279 131332 38.0 888d736f-99bc-49ca-969d-057e7d4bb8d1 1032763 39.0 I would like to apply cluster analysis to that data. If I try to apply k-means clustering to my data:

Cut off point in k-means clustering in sas

好久不见. 提交于 2019-12-11 08:57:51
问题 So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.) My code for clustering: proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0; var value resid; run; I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS? Edit: As

how to cluster users based on tags

穿精又带淫゛_ 提交于 2019-12-11 04:11:51
问题 I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this? Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j? In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program). I would like to expect at the end k number of clusters (maybe a

R - Phylogram labels to vector

南笙酒味 提交于 2019-12-11 03:37:17
问题 If we plot a phylogram from hierarchical clustering using ape package phy <- hclust(dist(mtcars)) plot(as.phylo(phy),direction="downwards") Is there a way to extract the labels in to a vector in the same order they appear in the phylogram (read from left to right)? If I try phy$labels I can get the labels out but they appear to be in a different order. 回答1: Using the additional order component, you can get them in the proper ordering with(phy, labels[order]) # [1] "Maserati Bora" "Chrysler