How to get n-gram collocations and association in python nltk?

后端 未结 2 1953
执念已碎
执念已碎 2021-02-04 16:47

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeas

2条回答
  •  闹比i
    闹比i (楼主)
    2021-02-04 17:22

    Edited

    The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder still stands, you would have to radically change the formulas in the from_words() function for different order of ngram.


    Short answer, no you cannot simply create an AbstractCollocationFinder (ACF) to call the nbest() function if you want to find collocations beyond 2- and 3-grams.

    It's because of the difference in the from_words() for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words() function.

    >>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
    >>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
    Traceback (most recent call last):
      File "", line 1, in 
    AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'
    

    So given this from_words() in TrigramCF:

    from nltk.probability import FreqDist
    @classmethod
    def from_words(cls, words):
        wfd, wildfd, bfd, tfd = (FreqDist(),)*4
    
        for w1,w2,w3 in ingrams(words,3,pad_right=True):
          wfd.inc(w1)
    
          if w2 is None:
            continue
          bfd.inc((w1,w2))
    
          if w3 is None:
            continue
          wildfd.inc((w1,w3))
          tfd.inc((w1,w2,w3))
    
        return cls(wfd, bfd, wildfd, tfd)
    

    You could somehow hack it and try to hardcode for a 4-gram association finder as such:

    @classmethod
    def from_words(cls, words):
        wfd, wildfd = (FreqDist(),)*2
        bfd, tfd ,fofd = (FreqDist(),)*3
    
        for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
          wfd.inc(w1)
    
          if w2 is None:
            continue
          bfd.inc((w1,w2))
    
          if w3 is None:
            continue
          wildfd.inc((w1,w3))
          tfd.inc((w1,w2,w3))
    
          if w4 is None:
            continue
          wildfd.inc((w1,w4))
          wildfd.inc((w2,w4))
          wildfd.inc((w3,w4))
          wildfd.inc((w1,w3))
          wildfd.inc((w2,w3))
          wildfd.inc((w1,w2))
          ffd.inc((w1,w2,w3,w4))
    
        return cls(wfd, bfd, wildfd, tfd, ffd)
    

    Then you would also have to change whichever part of the code that uses cls returned from the from_words respectively.

    So you have to ask what is the ultimate purpose of finding the collocations?

    • If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.

    • If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.

提交回复
热议问题