I\'m using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error \"Empty cluster created at iteration 1\". The script I\'m usin
It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than you specified.
Try changing the initialization method using the start
option. Kmeans provides four possible techniques to initialize clusters:
Also you can try the different values of emptyaction
option, which tells MATLAB what to do when a cluster becomes empty.
Ultimately, I think you need to reduce the number of clusters, i.e try K=2
clusters.
I tried to visualize your data to get a feel for it:
load matlab_X.mat
figure('renderer','zbuffer')
line(XX(:,1), XX(:,2), XX(:,3), ...
'LineStyle','none', 'Marker','.', 'MarkerSize',1)
axis vis3d; view(3); grid on
After some manual zooming/panning, it looks like a silhouette of a person:
You can see that the data of 307200 points is really dense and compact, which confirms what I suspected; the data doesnt have that many clusters.
Here is the code I tried:
>> [IDX,C] = kmeans(XX, 3, 'start','uniform', 'emptyaction','singleton');
>> tabulate(IDX)
Value Count Percent
1 18023 5.87%
2 264690 86.16%
3 24487 7.97%
Whats more, the entire points in cluster 2 are all duplicate points ([0 0 0]
):
>> unique(XX(IDX==2,:),'rows')
ans =
0 0 0
The other two clusters look like:
clr = lines(max(IDX));
for i=1:max(IDX)
line(XX(IDX==i,1), XX(IDX==i,2), XX(IDX==i,3), ...
'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1)
end
So you might get better clusters if you first remove duplicate points first...
In addition, you have a few outliers that might affect the result of clustering. Visually, I narrowed down the range of the data to the following intervals which encompasses most of the data:
>> xlim([-500 100])
>> ylim([-500 100])
>> zlim([900 1500])
Here is the result after removing dupe points (over 250K points) and outliers (around 250 data points), and clustering with K=3
(best of out of 5 runs with the replicates
option):
XX = unique(XX,'rows');
XX(XX(:,1) < -500 | XX(:,1) > 100, :) = [];
XX(XX(:,2) < -500 | XX(:,2) > 100, :) = [];
XX(XX(:,3) < 900 | XX(:,3) > 1500, :) = [];
[IDX,C] = kmeans(XX, 3, 'replicates',5);
with almost an equal split across the three clusters:
>> tabulate(IDX)
Value Count Percent
1 15605 36.92%
2 15048 35.60%
3 11613 27.48%
Recall that the default distance function is euclidean distance, which explains the shape of the formed clusters.
Amro described the reason clearly:
It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than you specified.
But the other option that could help to solve this problem is emptyaction
:
Action to take if a cluster loses all its member observations.
error
: Treat an empty cluster as an error (default).
drop
: Remove any clusters that become empty.kmeans
sets the corresponding return values inC
andD
toNaN
. (for information aboutC
andD
seekmeans
documentioan page)
singleton
: Create a new cluster consisting of the one point furthest from its centroid.
An example:
Let’s run a simple code to see how this option changes the behavior and results of kmeans
. This sample tries to partition 3 observations in 3 clusters while 2 of them are located at same point:
clc;
X = [1 2; 1 2; 2 3];
[I, C] = kmeans(X, 3, 'emptyaction', 'singleton');
[I, C] = kmeans(X, 3, 'emptyaction', 'drop');
[I, C] = kmeans(X, 3, 'emptyaction', 'error')
The first call with singleton
option displays a warning and returns:
I = C =
3 2 3
2 1 2
1 1 2
As you can see two cluster centroids are created at same location ([1 2]
), and two first rows of X
are assigned to these clusters.
The Second call with drop
option also displays same warning message, but returns different results:
I = C =
1 1 2
1 NaN NaN
3 2 3
It just returns two cluster centers and assigns two first rows of X
to same cluster. I think most of the times this option would be most useful. In cases that observations are too close and we need as more cluster centers as possible, we can let MATLAB decide about the number. You can remove NaN
rows form C
like this:
C(any(isnan(C), 2), :) = [];
And finally the third call generates an exception and halts the program as expected.
Empty cluster created at iteration 1.
If you are confident with your choice of "k=3", here is the code I wrote for not getting an empty cluster:
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
while length(unique(IDX))<3 || histc(histc(IDX,[1 2 3]),1)~=0
% i.e. while one of the clusters is empty -- or -- we have one or more clusters with only one member
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
end