N-grams: Explanation + 2 applications

后端 未结 2 568
醉酒成梦
醉酒成梦 2021-01-30 18:44

I want to implement some applications with n-grams (preferably in PHP).


Which type of n-grams is more adequate for most purposes? A word level or a character leve

2条回答
  •  囚心锁ツ
    2021-01-30 18:50

    You are correct about the definition of n-grams.

    You can use word level n-grams for search type applications. Character level n-grams can be used more for analysis of the text itself. For example, to identify the language of a text, I would use the frequencies of the letters as compared to the established frequencies of the language. That is, the text should roughly match the frequency of occurrence of letters in that language.

    An n-gram tokenizer for words in PHP can be done using strtok:

    http://us2.php.net/manual/en/function.strtok.php

    For characters use split:

    http://us2.php.net/manual/en/function.str-split.php

    Then you can just split the array as you'd like to any number of n-grams.

    Bayesian filters need to be trained for use as spam filters, which can be used in combination with n-grams. However you need to give it plenty of input in order for it to learn.

    Your last approach sounds decent as far as learning the context of a page... this is still however fairly difficult to do, but n-grams sounds like a good starting point for doing so.

提交回复
热议问题