hierarchical-clustering

Cutting dendrogram into n trees with minimum cluster size in R

阅读更多关于 Cutting dendrogram into n trees with minimum cluster size in R

问题 I'm trying to use hirearchical clustering (specifically hclust ) to cluster a data set into 10 groups with sizes of 100 members or fewer, and with no group having more than 40% of the total population. The only method I currently know is to repeatedly use cut() and select continually lower levels of h until I'm happy with the dispersion of the cuts. However, this forces me to then go back and re-cluster the groups I pruned to aggregate them into 100 member groups, which can be very time

Time series distance metric

阅读更多关于 Time series distance metric

问题 In order to clusterize a set of time series I'm looking for a smart distance metric. I've tried some well known metric but no one fits to my case. ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]: I want to put this new example [sx] in the most similar cluster: The most similar centroids is the second one, so I need to find a distance function d that gives me d(sx, s2) < d(sx, s1) and d(sx, s2) < d(sx, s3) edit Here the results with metrics [cosine,

How can I cluster thousands of documents using the R tm package?

阅读更多关于 How can I cluster thousands of documents using the R tm package?

I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make? Thank you for any help. make_clusters <- function(n) { require(tm) require(slam) docs <- unlist(lapply(letters[1:n],function(x) rep(x,1000))) tdf <- TermDocumentMatrix(Corpus(VectorSource(docs)),control=list

How to get centroids from SciPy's hierarchical agglomerative clustering?

阅读更多关于 How to get centroids from SciPy's hierarchical agglomerative clustering?

问题 I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code: Y = distance.pdist(features) Z = hierarchy.linkage(Y, method = "average", metric = "euclidean") T = hierarchy.fcluster(Z, 100, criterion = "maxclust") I am taking my matrix of features, computing the euclidean distance between them, and then passing

How to hierarchically cluster a data matrix in R?

阅读更多关于 How to hierarchically cluster a data matrix in R?

I am trying to cluster a data matrix produced from scientific data. I know how I want the clustering done, but am not sure how to accomplish this feat in R. Here is what the data looks like: A1 A2 A3 B1 B2 B3 C1 C2 C3 sample1 1 9 10 2 1 29 2 5 44 sample2 8 1 82 2 8 2 8 2 28 sample3 9 9 19 2 8 1 7 2 27 Please consider A1,A2,A3 to be three replicates of a single treatment, and likewise with B and C. Sample1 are different tested variables. So, I want to hierarchically cluster this matrix in order to see the over all differences between the columns, specifically I will be making a dendrogram (tree

Hierarchical clusterization heuristics

阅读更多关于 Hierarchical clusterization heuristics

问题 I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree . After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine

spatial clustering in R (simple example)

阅读更多关于 spatial clustering in R (simple example)

问题 I have this simple data.frame lat<-c(1,2,3,10,11,12,20,21,22,23) lon<-c(5,6,7,30,31,32,50,51,52,53) data=data.frame(lat,lon) The idea is to find the spatial clusters based on the distance First, I plot the map (lon,lat) : plot(data$lon,data$lat) so clearly I have three clusters based in the distance between the position of points. For this aim, I've tried this code in R : d= as.matrix(dist(cbind(data$lon,data$lat))) #Creat distance matrix d=ifelse(d<5,d,0) #keep only distance < 5 d=as.dist(d)

Extract the hierarchical structure of the nodes in a dendrogram or cluster

阅读更多关于 Extract the hierarchical structure of the nodes in a dendrogram or cluster

I would like to extract the hierarchical structure of the nodes of a dendrogram or cluster. For example in the next example: library(dendextend) dend15 <- c(1:5) %>% dist %>% hclust(method = "average") %>% as.dendrogram dend15 %>% plot The nodes are classified according their position in the dendrogram (see figure below) (Figure extracted from the dendextend package's tutorial) I would like to get all the nodes for each final leaf as the next output: (the labels are ordered from left to right and from bottom to top) hierarchical structure leaf_1: 3-2-1 leaf_2: 4-2-1 leaf_3: 6-5-1 leaf_4: 8-7-5

Using iGraph in python for community detection and writing community number for each node to CSV

阅读更多关于 Using iGraph in python for community detection and writing community number for each node to CSV

问题 I have an network that I would like to analyze using the edge_betweenness community detection algorithm in iGraph. I'm familiar with NetworkX, but am trying to learning iGraph because of it's additional community detection methods over NetworkX. My ultimate goal is to run edge_betweenness community detection and find the optimal number of communities and write a CSV with community membership for each node in the graph. Below is my code as it currently stands. Any help figuring out community

Plot the cluster member in r

阅读更多关于 Plot the cluster member in r

I use DTW package in R. and I finally finished hierarchical clustering. but I wanna plot time-series cluster separately like below picture. sc <- read.table("D:/handling data/confirm.csv", header=T, sep="," ) rownames(sc) <- sc$STDR_YM_CD sc$STDR_YM_CD <- NULL col_n <- colnames(sc) hc <- hclust(dist(sc), method="average") plot(hc, main="") How can I do it?? My data in http://blogattach.naver.com/e772fb415a6c6ddafd1370417f96e494346a9725/20170207_141_blogfile/khm2963_1486442387926_THgZRt_csv/confirm.csv?type=attachment You can try this: sc <- read.table("confirm.csv", header=T, sep="," )