I\'m trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means.
Here
Covariance tells you how the data varies in the space, if a distribution has large covariance, that means data is more spread and vice versa. When you have the PDF of a gaussian distribution (mean and covariance params), you can check the membership confidence of a test point under that distribution.
However GMM also suffers from the weakness of K-Means, that you have to pick the parameter K which is the number of clusters. This requires a good understanding of your data's multimodality.
I think it would help if you first look at what a GMM model represents. I'll be using functions from the Statistics Toolbox, but you should be able to do the same using VLFeat.
Let's start with the case of a mixture of two 1-dimensional normal distributions. Each Gaussian is represented by a pair of mean and variance. The mixture assign a weight to each component (prior).
For example, lets mix two normal distributions with equal weights (p = [0.5; 0.5]
), the first centered at 0 and the second at 5 (mu = [0; 5]
), and the variances equal 1 and 2 respectively for the first and second distributions (sigma = cat(3, 1, 2)
).
As you can see below, the mean effectively shifts the distribution, while the variance determines how wide/narrow and flat/pointy it is. The prior sets the mixing proportions to get the final combined model.
% create GMM
mu = [0; 5];
sigma = cat(3, 1, 2);
p = [0.5; 0.5];
gmm = gmdistribution(mu, sigma, p);
% view PDF
ezplot(@(x) pdf(gmm,x));
The idea of EM clustering is that each distribution represents a cluster. So in the example above with one dimensional data, if you were given an instance x = 0.5
, we would assign it as belonging to the first cluster/mode with 99.5% probability
>> x = 0.5;
>> posterior(gmm, x)
ans =
0.9950 0.0050 % probability x came from each component
you can see how the instance falls well under the first bell-curve. Whereas if you take a point in the middle, the answer would be more ambiguous (point assigned to class=2 but with much less certainty):
>> x = 2.2
>> posterior(gmm, 2.2)
ans =
0.4717 0.5283
The same concepts extend to higher dimension with multivariate normal distributions. In more than one dimension, the covariance matrix is a generalization of variance, in order to account for inter-dependencies between features.
Here is an example again with a mixture of two MVN distributions in 2-dimensions:
% first distribution is centered at (0,0), second at (-1,3)
mu = [0 0; 3 3];
% covariance of first is identity matrix, second diagonal
sigma = cat(3, eye(2), [5 0; 0 1]);
% again I'm using equal priors
p = [0.5; 0.5];
% build GMM
gmm = gmdistribution(mu, sigma, p);
% 2D projection
ezcontourf(@(x,y) pdf(gmm,[x y]));
% view PDF surface
ezsurfc(@(x,y) pdf(gmm,[x y]));
There is some intuition behind how the the covariance matrix affects the shape of the joint density function. For instance in 2D, if the matrix is diagonal it implies that the two dimensions don't co-vary. In that case the PDF would look like an axis-aligned ellipse stretched out either horizontally or vertically according to which dimension has the bigger variance. If they are equal, then the shape is a perfect circle (distribution spread out in both dimensions at an equal rate). Finally if the covariance matrix is arbitrary (non-diagonal but still symmetric by definition), then it will probably look like a stretched ellipse rotated at some angle.
So in the previous figure, you should be able to tell the two "bumps" apart and what individual distribution each represent. When you go 3D and higher dimensions, think of the it as representing (hyper-)ellipsoids in N-dims.
Now when you're performing clustering using GMM, the goal is to find the model parameters (mean and covariance of each distribution as well as the priors) so that the resulting model best fits the data. The best-fit estimation translates into maximizing the likelihood of the data given the GMM model (meaning you choose model that maximizes Pr(data|model)
).
As other have explained, this is solved iteratively using the EM algorithm; EM starts with an initial estimate or guess of the parameters of the mixture model. It iteratively re-scores the data instances against the mixture density produced by the parameters. The re-scored instances are then used to update the parameter estimates. This is repeated until the algorithm converges.
Unfortunately the EM algorithm is very sensitive to the initialization of the model, so it might take a long time to converge if you set poor initial values, or even get stuck in local optima. A better way to initial the GMM parameters is to use K-means as a first step (like you've shown in your code), and using the mean/cov of those clusters to initialize EM.
As with other cluster analysis techniques, we first need to decide on the number of clusters to use. Cross-validation is a robust way to find a good estimate of the number of clusters.
EM clustering suffers from the fact that there a lot parameters to fit, and usually requires lots of data and many iterations to get good results. An unconstrained model with M-mixtures and D-dimensional data involves fitting D*D*M + D*M + M
parameters (M covariance matrices each of size DxD, plus M mean vectors of length D, plus a vector of priors of length M). That could be a problem for datasets with large number of dimensions. So it is customary to impose restrictions and assumption to simplify the problem (a sort of regularization to avoid overfitting problems). For instance you could fix the covariance matrix to be only diagonal or even have the covariance matrices shared across all Gaussians.
Finally once you've fitted the mixture model, you can explore the clusters by computing the posterior probability of data instances using each mixture component (like I've showed with the 1D example). GMM assigns each instance to a cluster according to this "membership" likelihood.
Here is a more complete example of clustering data using Gaussian mixture models:
% load Fisher Iris dataset
load fisheriris
% project it down to 2 dimensions for the sake of visualization
[~,data] = pca(meas,'NumComponents',2);
mn = min(data); mx = max(data);
D = size(data,2); % data dimension
% inital kmeans step used to initialize EM
K = 3; % number of mixtures/clusters
cInd = kmeans(data, K, 'EmptyAction','singleton');
% fit a GMM model
gmm = fitgmdist(data, K, 'Options',statset('MaxIter',1000), ...
'CovType','full', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);
% means, covariances, and mixing-weights
mu = gmm.mu;
sigma = gmm.Sigma;
p = gmm.PComponents;
% cluster and posterior probablity of each instance
% note that: [~,clustIdx] = max(p,[],2)
[clustInd,~,p] = cluster(gmm, data);
tabulate(clustInd)
% plot data, clustering of the entire domain, and the GMM contours
clrLite = [1 0.6 0.6 ; 0.6 1 0.6 ; 0.6 0.6 1];
clrDark = [0.7 0 0 ; 0 0.7 0 ; 0 0 0.7];
[X,Y] = meshgrid(linspace(mn(1),mx(1),50), linspace(mn(2),mx(2),50));
C = cluster(gmm, [X(:) Y(:)]);
image(X(:), Y(:), reshape(C,size(X))), hold on
gscatter(data(:,1), data(:,2), species, clrDark)
h = ezcontour(@(x,y)pdf(gmm,[x y]), [mn(1) mx(1) mn(2) mx(2)]);
set(h, 'LineColor','k', 'LineStyle',':')
hold off, axis xy, colormap(clrLite)
title('2D data and fitted GMM'), xlabel('PC1'), ylabel('PC2')
You are right, there is the same insight behind clustering with K-Means or GMM. But as you mentionned Gaussian Mixtures take data covariances into account. To find the maximum likelihood parameters (or maximum a posteriori MAP) of the GMM statistical model, you need to use an iterative process called the EM algorithm. Each iteration is composed of a E-step (Expectation) and a M-step (Maximization) and repeat until convergence. After convergence you can easily estimate the membership probabilities of each data vectors for each cluster model.