I want to implement some applications with n-grams (preferably in PHP).
Which type of n-grams is more adequate for most purposes? A word level or a character leve
You are correct about the definition of n-grams.
You can use word level n-grams for search type applications. Character level n-grams can be used more for analysis of the text itself. For example, to identify the language of a text, I would use the frequencies of the letters as compared to the established frequencies of the language. That is, the text should roughly match the frequency of occurrence of letters in that language.
An n-gram tokenizer for words in PHP can be done using strtok:
http://us2.php.net/manual/en/function.strtok.php
For characters use split:
http://us2.php.net/manual/en/function.str-split.php
Then you can just split the array as you'd like to any number of n-grams.
Bayesian filters need to be trained for use as spam filters, which can be used in combination with n-grams. However you need to give it plenty of input in order for it to learn.
Your last approach sounds decent as far as learning the context of a page... this is still however fairly difficult to do, but n-grams sounds like a good starting point for doing so.