R, issue with a Hierarchical clustering after a Multiple correspondence analysis

▼魔方 西西 提交于 2019-12-21 15:45:35

问题


I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components. My vectors are composed by one email and by 30 qualitative variables. Each quantitative variable has 4 classes: 0,1,2 and 3.

So first thing I'm doing is to load the library FactoMineR and to load my data:

library(FactoMineR)
mydata = read.csv("/home/tom/Desktop/ACM/acm.csv")

Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though):

for(n in 1:length(mydata)){mydata[[n]] <- factor(mydata[[n]])}

I'm removing the emails from my vectors:

mydata2 = mydata[2:31]

And I'm running a MCA in this new dataset:

mca.res <- MCA(mydata2)

I now want to cluster my dataset using the hcpc function:

res.hcpc <- HCPC(mca.res)

But I got the following error message:

Error: cannot allocate vector of size 1296.0 Gb

What do you think I should do? Is my dataset too large? Am I using well the hcpc function?


回答1:


Since it uses hierarchical clustering, HCPC needs to compute the lower triangle of a 600000 x 600000 distance matrix (~ 180 billion elements). You simply don't have the RAM to store this object and even if you did, the computation would likely take hours if not days to complete.

There have been various discussions on Stack Overflow/Cross Validated on clustering large datasets; some with solutions in R include:

k-means clustering in R on very large, sparse matrix? (bigkmeans)

Cluster Big Data in R and Is Sampling Relevant? (clara)

If you want to use one of these alternative clustering approaches, you would apply it to mca.res$ind$coord in your example.

Another idea, suggested in response to the problem clustering very large dataset in R, is to first use k means to find a certain number of cluster centres and then use hierarchical clustering to build the tree from there. This method is actually implemented via the kk argument of HCPC.

For example, using the tea data set from FactoMineR:

library(FactoMineR)
data(tea)
## run MCA as in ?MCA
res.mca <- MCA(tea, quanti.sup = 19, quali.sup = c(20:36), graph = FALSE)
## run HCPC for all 300 individuals
hc <- HCPC(res.mca, kk = Inf, consol = FALSE)
## run HCPC from 30 k means centres
res.consol <- NULL ## bug work-around
hc2 <- HCPC(res.mca, kk = 30, consol = FALSE)

The consol argument offers the option to consolidate the clusters from the hierarchical clustering using k-means; this option is not available when kk is set to a real number, hence consol is set to FALSE here. The object res.consul is set to NULL to work around a minor bug in FactoMineR 1.27.

The following plot show the clusters based on the 300 individuals (kk = Inf) and based on the 30 k means centres (kk = 30) for the data plotted on the first two MCA axes:

It can be seen that the results are very similar. You should easily be able to apply this to your data with 600 or 1000 k means centres, perhaps up to 6000 with 8GB RAM. If you wanted to use a larger number, you'd probably want to code a more efficient version using bigkmeans, SpatialTools::dist1 and fastcluster::hclust.




回答2:


That error message usually indicates that R has not enough RAM at its disposal to complete the command. I guess you are running this within 32bit R, possibly under Windows? If this is the case, then killing other processes and deleting unused R variables might possibly help: for example, you might try to delete mydata, mydata2 with

rm(mydata, mydata2) 

(as well as all other non-necessary R variables) before executing the command which generates the error. However the ultimate solution in general is to switch to 64bit R, preferably under 64bit Linux and with a decent RAM amount, also see here:

R memory management / cannot allocate vector of size n Mb

R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"

http://r.789695.n4.nabble.com/Error-cannot-allocate-vector-of-size-td3629384.html



来源:https://stackoverflow.com/questions/27269555/r-issue-with-a-hierarchical-clustering-after-a-multiple-correspondence-analysis

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!