This is a Python and NLTK newbie question.
I want to find frequency of bigrams which occur more than 10 times together and have the highest PMI.
For this,
The problem is with the way you are trying to use apply_freq_filter
.
We are discussing about word collocations. As you know, a word collocation is about dependency between words. The BigramCollocationFinder
class inherits from a class named AbstractCollocationFinder
and the function apply_freq_filter
belongs to this class. apply_freq_filter
is not supposed to totally delete some word collocations, but to provide a filtered list of collocations if some other functions try to access the list.
Now why is that? Imagine that if filtering collocations was simply deleting them, then there were many probability measures such as likelihood ratio or the PMI itself (that compute probability of a word relative to other words in a corpus) which would not function properly after deleting words from random positions in the given corpus. By deleting some collocations from the given list of words, many potential functionalities and computations would be disabled. Also, computing all of these measures before the deletion, would bring a massive computation overhead which the user might not need after all.
Now, the question is how to correctly use the apply_freq_filter function
? There are a few ways. In the following I will show the problem and its solution.
Lets define a sample corpus and split it to a list of words similar to what you have done:
tweet_phrases = "I love iphone . I am so in love with iphone . iphone is great . samsung is great . iphone sucks. I really really love iphone cases. samsung can never beat iphone . samsung is better than apple"
from nltk.collocations import *
import nltk
For the purpose of experimenting I set the window size to 3:
finder = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
finder1 = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
Notice that for the sake of comparison I only use the filter on finder1
:
finder1.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
Now if I write:
for k,v in finder.ngram_fd.items():
print(k,v)
The output is:
(('.', 'is'), 3)
(('iphone', '.'), 3)
(('love', 'iphone'), 3)
(('.', 'iphone'), 2)
(('.', 'samsung'), 2)
(('great', '.'), 2)
(('iphone', 'I'), 2)
(('iphone', 'samsung'), 2)
(('is', '.'), 2)
(('is', 'great'), 2)
(('samsung', 'is'), 2)
(('.', 'I'), 1)
(('.', 'am'), 1)
(('.', 'sucks.'), 1)
(('I', 'am'), 1)
(('I', 'iphone'), 1)
(('I', 'love'), 1)
(('I', 'really'), 1)
(('I', 'so'), 1)
(('am', 'in'), 1)
(('am', 'so'), 1)
(('beat', '.'), 1)
(('beat', 'iphone'), 1)
(('better', 'apple'), 1)
(('better', 'than'), 1)
(('can', 'beat'), 1)
(('can', 'never'), 1)
(('cases.', 'can'), 1)
(('cases.', 'samsung'), 1)
(('great', 'iphone'), 1)
(('great', 'samsung'), 1)
(('in', 'love'), 1)
(('in', 'with'), 1)
(('iphone', 'cases.'), 1)
(('iphone', 'great'), 1)
(('iphone', 'is'), 1)
(('iphone', 'sucks.'), 1)
(('is', 'better'), 1)
(('is', 'than'), 1)
(('love', '.'), 1)
(('love', 'cases.'), 1)
(('love', 'with'), 1)
(('never', 'beat'), 1)
(('never', 'iphone'), 1)
(('really', 'iphone'), 1)
(('really', 'love'), 1)
(('samsung', 'better'), 1)
(('samsung', 'can'), 1)
(('samsung', 'great'), 1)
(('samsung', 'never'), 1)
(('so', 'in'), 1)
(('so', 'love'), 1)
(('sucks.', 'I'), 1)
(('sucks.', 'really'), 1)
(('than', 'apple'), 1)
(('with', '.'), 1)
(('with', 'iphone'), 1)
I will get the same result if I write the same for finder1
. So, at first glance the filter doesn't work. However, see how it has worked: The trick is to use score_ngrams
.
If I use score_ngrams
on finder
, it would be:
finder.score_ngrams (bigram_measures.pmi)
and the output is:
[(('am', 'in'), 5.285402218862249), (('am', 'so'), 5.285402218862249), (('better', 'apple'), 5.285402218862249), (('better', 'than'), 5.285402218862249), (('can', 'beat'), 5.285402218862249), (('can', 'never'), 5.285402218862249), (('cases.', 'can'), 5.285402218862249), (('in', 'with'), 5.285402218862249), (('never', 'beat'), 5.285402218862249), (('so', 'in'), 5.285402218862249), (('than', 'apple'), 5.285402218862249), (('sucks.', 'really'), 4.285402218862249), (('is', 'great'), 3.7004397181410926), (('I', 'am'), 3.7004397181410926), (('I', 'so'), 3.7004397181410926), (('cases.', 'samsung'), 3.7004397181410926), (('in', 'love'), 3.7004397181410926), (('is', 'better'), 3.7004397181410926), (('is', 'than'), 3.7004397181410926), (('love', 'cases.'), 3.7004397181410926), (('love', 'with'), 3.7004397181410926), (('samsung', 'better'), 3.7004397181410926), (('samsung', 'can'), 3.7004397181410926), (('samsung', 'never'), 3.7004397181410926), (('so', 'love'), 3.7004397181410926), (('sucks.', 'I'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'am'), 2.9634741239748865), (('.', 'sucks.'), 2.9634741239748865), (('beat', '.'), 2.9634741239748865), (('with', '.'), 2.9634741239748865), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('I', 'really'), 2.7004397181410926), (('beat', 'iphone'), 2.7004397181410926), (('great', 'samsung'), 2.7004397181410926), (('iphone', 'cases.'), 2.7004397181410926), (('iphone', 'sucks.'), 2.7004397181410926), (('never', 'iphone'), 2.7004397181410926), (('really', 'love'), 2.7004397181410926), (('samsung', 'great'), 2.7004397181410926), (('with', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('I', 'love'), 2.115477217419936), (('iphone', '.'), 1.963474123974886), (('great', 'iphone'), 1.7004397181410922), (('iphone', 'great'), 1.7004397181410922), (('really', 'iphone'), 1.7004397181410922), (('.', 'iphone'), 1.37851162325373), (('.', 'I'), 1.37851162325373), (('love', '.'), 1.37851162325373), (('I', 'iphone'), 1.1154772174199366), (('iphone', 'is'), 1.1154772174199366)]
Now notice what happens when I compute the same for finder1
which was filtered to a frequency of 2:
finder1.score_ngrams(bigram_measures.pmi)
and the output:
[(('is', 'great'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('iphone', '.'), 1.963474123974886), (('.', 'iphone'), 1.37851162325373)]
Notice that all the collocations that had a frequency of less than 2 don't exist in this list; and it's exactly the result you were looking for. So the filter has worked. Also, the documentation gives a minimal hint about this issue.
I hope this has answered your question. Otherwise, please let me know.
Disclaimer: If you are primarily dealing with tweets, a window size of 13 is way too big. If you noticed, in my sample corpus the size of my sample tweets were too small that applying a window size of 13 can cause finding collocations that are irrelevant.
Do go through the tutorial at http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html for more usage of collocation
functions in NLTK
and also the math in https://en.wikipedia.org/wiki/Pointwise_mutual_information. Hope the following script helps you since your code question didnt specify what's the input.
# This is just a fancy way to create document.
# I assume you have your texts in a continuous string format
# where each sentence ends with a fullstop.
>>> from itertools import chain
>>> docs = ["this is a sentence", "this is a foo bar", "you are a foo bar", "yes , i am"]
>>> texts = list(chain(*[(j+" .").split() for j in [i for i in docs]]))
# This is the NLTK part
>>> from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder>>> bigram_measures= BigramAssocMeasures()
>>> finder BigramCollocationFinder.from_words(texts)
# This gets the top 20 bigrams according to PMI
>>> finder.nbest(bigram_measures.pmi,20)
[(',', 'i'), ('i', 'am'), ('yes', ','), ('you', 'are'), ('foo', 'bar'), ('this', 'is'), ('a', 'foo'), ('is', 'a'), ('a', 'sentence'), ('are', 'a'), ('bar', '.'), ('.', 'yes'), ('.', 'you'), ('am', '.'), ('sentence', '.'), ('.', 'this')]
PMI measures the association of two words by calculating the log ( p(x|y) / p(x) )
, so it's not only about the frequency of a word occurrence or a set of words concurring together. To achieve high PMI, you need both:
Here's some extreme PMI examples.
let's say you have 100 words in the corpus, and if frequency is of a certain word X
is 1 and it only occurs with another word Y
only once, then:
p(x|y) = 1
p(x) = 1/100
PMI = log(1 / 1/100) = log 0.01 = -2
let's say you have 100 words in the corpus and if frequency of a certain word is 90 but it never occurs with another word Y
, then the PMI is
p(x|y) = 0
p(x) = 90/100
PMI = log(0 / 90/100) = log 0 = -infinity
so in that sense the first scenario is >>> PMI between X,Y than the second scenario even though the frequency of the second word is very high.