问题
I have a distance/dissimilarity matrix (30K rows 30K columns) that is calculated in a loop and stored in ROM.
I would like to do clustering over the matrix. I import and cluster it as below:
Mydata<-read.csv("Mydata.csv")
Mydata<-as.dist(Mydata)
Results<-hclust(Mydata)
But when I convert the matrix to dist
object, I get RAM limitation error. How can I handle it? Can I run hclust
algorithm in a loop/chunking? I mean I divide the distance matrix into chunks and run them in a loop?
回答1:
You may try the following:
Mydata<-read.csv("Mydata.csv")
Mydata<-as.matrix(Mydata)
Mydata<-as.dist(Mydata)
Results<-hclust(Mydata)
Read the following to track what's happening in your session: http://adv-r.had.co.nz/memory.html
This might be helpful in general: https://cran.r-project.org/web/packages/fastcluster/ And also this question: hclust() in R on large datasets
It also depends on your OS, but maybe you can change the RAM limit (or just run this code on someone else's computer with more RAM, store the object using saveRDS and then read it in your own computer using readRDS).
来源:https://stackoverflow.com/questions/53032431/is-it-possible-to-run-a-clustering-algorithm-with-chunked-distance-matrices