Out of memory error while using clusterdata in MATLAB

感情迁移 提交于 2019-11-30 09:59:33

As mentioned by others, hierarchical clustering needs to calculate the pairwise distance matrix which is too big to fit in memory in your case.

Try using the K-Means algorithm instead:

numClusters = 4;
T = kmeans(X, numClusters);

Alternatively you can select a random subset of your data and use as input to the clustering algorithm. Next you compute the cluster centers as mean/median of each cluster group. Finally for each instance that was not selected in the subset, you simply compute its distance to each of the centroids and assign it to the closest one.

Here's a sample code to illustrate the idea above:

%# random data
X = rand(25000, 2);

%# pick a subset
SUBSET_SIZE = 1000;            %# subset size
ind = randperm(size(X,1));
data = X(ind(1:SUBSET_SIZE), :);

%# cluster the subset data
D = pdist(data, 'euclid');
T = linkage(D, 'ward');
CUTOFF = 0.6*max(T(:,3));      %# CUTOFF = 5;
C = cluster(T, 'criterion','distance', 'cutoff',CUTOFF);
K = length( unique(C) );       %# number of clusters found

%# visualize the hierarchy of clusters
figure(1)
h = dendrogram(T, 0, 'colorthreshold',CUTOFF);
set(h, 'LineWidth',2)
set(gca, 'XTickLabel',[], 'XTick',[])

%# plot the subset data colored by clusters
figure(2)
subplot(121), gscatter(data(:,1), data(:,2), C), axis tight

%# compute cluster centers
centers = zeros(K, size(data,2));
for i=1:size(data,2)
    centers(:,i) = accumarray(C, data(:,i), [], @mean);
end

%# calculate distance of each instance to all cluster centers
D = zeros(size(X,1), K);
for k=1:K
    D(:,k) = sum( bsxfun(@minus, X, centers(k,:)).^2, 2);
end
%# assign each instance to the closest cluster
[~,clustIDX] = min(D, [], 2);

%#clustIDX( ind(1:SUBSET_SIZE) ) = C;

%# plot the entire data colored by clusters
subplot(122), gscatter(X(:,1), X(:,2), clustIDX), axis tight

X is too big to do on a 32 bit machine. pdist is trying to make a 201,131,596 row vector (clusterdata uses pdist) of doubles, which would use up about 1609MB (double is 8 bytes) ... if you run it under windows with the /3GB switch you're limited to a maximum matrix size of 1536MB (see here).

You're going to need to divide up the data someway instead of directly clustering all of it in one go.

PDIST calculates distances between all possible pairs of rows. If your data contain N=20057 rows, then number of pairs will be N*(N-1)/2, which is 201131596 in your case. Might be too much for your machine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!