发表新帖

发表新帖

N-grams: Explanation + 2 applications

后端未结

关注

 2  568

醉酒成梦 2021-01-30 18:44

I want to implement some applications with n-grams (preferably in PHP).

Which type of n-grams is more adequate for most purposes? A word level or a character leve

2条回答

囚心锁ツ (楼主)

2021-01-30 18:50

You are correct about the definition of n-grams.

You can use word level n-grams for search type applications. Character level n-grams can be used more for analysis of the text itself. For example, to identify the language of a text, I would use the frequencies of the letters as compared to the established frequencies of the language. That is, the text should roughly match the frequency of occurrence of letters in that language.

An n-gram tokenizer for words in PHP can be done using strtok:

http://us2.php.net/manual/en/function.strtok.php

For characters use split:

http://us2.php.net/manual/en/function.str-split.php

Then you can just split the array as you'd like to any number of n-grams.

Bayesian filters need to be trained for use as spam filters, which can be used in combination with n-grams. However you need to give it plenty of input in order for it to learn.

Your last approach sounds decent as far as learning the context of a page... this is still however fairly difficult to do, but n-grams sounds like a good starting point for doing so.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题