In this documentation, there is example using nltk.collocations.BigramAssocMeasures()
, BigramCollocationFinder
,nltk.collocations.TrigramAssocMeas
The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder
still stands, you would have to radically change the formulas in the from_words()
function for different order of ngram.
Short answer, no you cannot simply create an AbstractCollocationFinder
(ACF) to call the nbest()
function if you want to find collocations beyond 2- and 3-grams.
It's because of the difference in the from_words()
for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words()
function.
>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
File "", line 1, in
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'
So given this from_words()
in TrigramCF:
from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
wfd, wildfd, bfd, tfd = (FreqDist(),)*4
for w1,w2,w3 in ingrams(words,3,pad_right=True):
wfd.inc(w1)
if w2 is None:
continue
bfd.inc((w1,w2))
if w3 is None:
continue
wildfd.inc((w1,w3))
tfd.inc((w1,w2,w3))
return cls(wfd, bfd, wildfd, tfd)
You could somehow hack it and try to hardcode for a 4-gram association finder as such:
@classmethod
def from_words(cls, words):
wfd, wildfd = (FreqDist(),)*2
bfd, tfd ,fofd = (FreqDist(),)*3
for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
wfd.inc(w1)
if w2 is None:
continue
bfd.inc((w1,w2))
if w3 is None:
continue
wildfd.inc((w1,w3))
tfd.inc((w1,w2,w3))
if w4 is None:
continue
wildfd.inc((w1,w4))
wildfd.inc((w2,w4))
wildfd.inc((w3,w4))
wildfd.inc((w1,w3))
wildfd.inc((w2,w3))
wildfd.inc((w1,w2))
ffd.inc((w1,w2,w3,w4))
return cls(wfd, bfd, wildfd, tfd, ffd)
Then you would also have to change whichever part of the code that uses cls
returned from the from_words
respectively.
So you have to ask what is the ultimate purpose of finding the collocations?
If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.
If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.