I\'m sure there is a post on this, but I couldn\'t find one asking this exact question. Consider the following:
This is the problem of language modeling. For a baseline approach, The only thing you need is a hash table mapping fixed-length chains of words, say of length k, to the most probable following word.(*)
At training time, you break the input into (k+1)-grams using a sliding window. So if you encounter
The wrath sing, goddess, of Peleus' son, Achilles
you generate, for k=2,
START START the
START the wrath
the wrath sing
wrath sing goddess
goddess of peleus
of peleus son
peleus son achilles
This can be done in linear time. For each 3-gram, tally (in a hash table) how often the third word follows the first two.
Finally, loop through the hash table and for each key (2-gram) keep only the most commonly occurring third word. Linear time.
At prediction time, look only at the k (2) last words and predict the next word. This takes only constant time since it's just a hash table lookup.
If you're wondering why you should keep only short subchains instead of full chains, then look into the theory of Markov windows. If your model were to remember all the chains of words that it has seen in its input, then it would badly overfit its training data and only reproduce its input at prediction time. How badly depends on the training set (more data is better), but for k>4 you'd really need smoothing in your model.
(*) Or to a probability distribution, but this is not needed for your simple example use case.
Yeh Whye Teh also has some recent interesting work that addresses this problem. The "Sequence Memoizer" extends the traditional prediction-by-partial-matching scheme to take into account arbitrarily long histories.
Here is a link the original paper: http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
It is also worth reading some of the background work, which can be found in the paper "A Bayesian Interpretation of Interpolated Kneser-Ney"