N-grams: Explanation + 2 applications

后端 未结 2 569
醉酒成梦
醉酒成梦 2021-01-30 18:44

I want to implement some applications with n-grams (preferably in PHP).


Which type of n-grams is more adequate for most purposes? A word level or a character leve

2条回答
  •  醉梦人生
    2021-01-30 19:14

    Word n-grams will generally be more useful for most text analysis applications you mention with the possible exception of language detection, where something like character trigrams might give better results. Effectively, you would create n-gram vector for a corpus of text in each language you are interested in detecting and then compare the frequencies of trigrams in each corpus to the trigrams in the document you are classifying. For example, the trigram the probably appears much more frequently in English than in German and would provide some level of statistical correlation. Once you have your documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N- Nearest Neighbor, Support Vector Machines, etc..

    Of the applications you mention, machine translation is probably the most farfetched, as n-grams alone will not bring you very far down the path. Converting an input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.

    One thing to watch out for, is that it isn't enough to create a vector [1,1,1,2,1] for one document and a vector [2,1,2,4] for another document, if the dimensions don't match. That is, the first entry in the vector can not be the in one document and is in another or the algorithms won't work. You will wind up with vectors like [0,0,0,0,1,1,0,0,2,0,0,1] as most documents will not contain most n-grams you are interested in. This 'lining up' of features is essential, and it requires you to decide 'in advance' what ngrams you will be including in your analysis. Often, this is implemented as a two pass algorithm, to first decide the statistical significance of various n-grams to decide what to keep. Google 'feature selection' for more information.

    Word based n-grams plus Support Vector Machines in an excellent way to perform topic spotting, but you need a large corpus of text pre classified into 'on topic' and 'off topic' to train the classifier. You will find a large number of research papers explaining various approaches to this problem on a site like citeseerx. I would not recommend the euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that both include the, a, is, and of would be considered a better match than two documents that both included Baysian. Removing stop-words from your n-grams of interest would improve this somewhat.

提交回复
热议问题