I\'ve been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf T
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up