K-means with really large matrix

落爺英雄遲暮 提交于 2019-12-18 15:48:33

问题


I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space.

I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough space).

So my question is : what software should I use (eventually in association with some other packages or custom settings).

Thanks for helping me.

Note : I use linux.


回答1:


Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?




回答2:


I keep the link (that can be useful to the specific user) but I agree with Gavin's comment! To perform a k-means clustering on Big Data you can use the rxKmeans function implemented in the Revolution R Enterprise proprietary implementation of R (I know this can be a problem); this function seems to be capable of manage that kind of data.




回答3:


Since we know nothing at all about the data, nor the questioner's goals for it, just a couple of general links:
I. Guyon's video lectures — many papers and books too.
feature selection on stats.stackexchange




回答4:


Check out Mahout, it will do k means on a large data set:

http://mahout.apache.org/



来源:https://stackoverflow.com/questions/6372397/k-means-with-really-large-matrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!