How to interpret Python NLTK bigram likelihood ratios?

自古美人都是妖i 提交于 2019-12-11 17:32:30

问题


I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).

import nltk.collocations
import nltk.corpus
import collections

bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
    prefix_keys[key[0]].append((key[1], scores))

for key in prefix_keys:
    prefix_keys[key].sort(key = lambda x: -x[1])

prefix_keys['baseball']

With the following output:

[('game', 32.11075451975229),
 ('cap', 27.81891372457088),
 ('park', 23.509042621473505),
 ('games', 23.10503351305401),
 ("player's", 16.22787286342467),
 ('rightfully', 16.22787286342467),
[...]

Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from

"Scores ngrams using likelihood ratios as in Manning and Schutze 5.3.4."

Referring to this article, which states on pg. 22:

One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest. This number is easier to interpret than the scores of the t test or the 2 test which we have to look up in a table.

What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?

Any help/guidance towards a clearer interpretation or example is much appreciated!


回答1:


nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.

The likelihood scores reflect the likelihood, within the corpus, of each of those result grams being preceded by the word 'baseball'.

base rate of occurrence would describe how often the word game occurs after baseball throughout the corpus, without taking into consideration the frequency of baseball or game throughout the corpus.

nltk.corpus.brown 

is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.

UPDATE in response to OP comment:

As in 32% of 'game' occurrences are preceded by 'baseball'. This is slightly misleading, and the likelihood score does not directly model a frequency distribution of the bigram.

nltk.collocations.BigramAssocMeasures().raw_freq

models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.

The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.

https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

Section 5.3.4 describes their calculations in detail on how the calculation is done.

They take into account frequency of word one in the document, frequency of word two in the document, and frequency of the bigram in the document in a manner that is well-suited to sparse matrices like corpus matrices.

If you are familiar with the TF-IDF vectorization method, this ratio aims for something similar as far as normalizing noisy features.

The score can be infinitely large. The relative difference between scores reflects those inputs I just described (corpus frequencies of word 1, word 2 and word1word2).

This chart is the most intuitive piece of their explanation, unless you're a statistician:

The likelihood score is calculated as the leftmost column.



来源:https://stackoverflow.com/questions/48715547/how-to-interpret-python-nltk-bigram-likelihood-ratios

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!