How to get n-gram collocations and association in python nltk?

后端 未结 2 1954
执念已碎
执念已碎 2021-02-04 16:47

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeas

相关标签:
2条回答
  • If you want to find the grams beyond 2 or 3 grams you can use scikit package and Freqdist function to get the count for these grams. I tried doing this with nltk.collocations, but I dont think we can find out more than 3-grams score into it. So I rather decided to go with count of grams. I hope this can help u a little bit. Thankz

    here is the code

    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.feature_extraction.text import CountVectorizer
    from nltk.collocations import *
    from nltk.probability import FreqDist
    import nltk
    
    query = "This document gives a very short introduction to machine learning problems"
    vect = CountVectorizer(ngram_range=(1,4))
    analyzer = vect.build_analyzer()
    listNgramQuery = analyzer(query)
    listNgramQuery.reverse()
    print "listNgramQuery=", listNgramQuery
    NgramQueryWeights = nltk.FreqDist(listNgramQuery)
    print "\nNgramQueryWeights=", NgramQueryWeights
    

    This will give output as

    listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this']
    
    NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...>
    
    0 讨论(0)
  • 2021-02-04 17:22

    Edited

    The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder still stands, you would have to radically change the formulas in the from_words() function for different order of ngram.


    Short answer, no you cannot simply create an AbstractCollocationFinder (ACF) to call the nbest() function if you want to find collocations beyond 2- and 3-grams.

    It's because of the difference in the from_words() for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words() function.

    >>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
    >>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'
    

    So given this from_words() in TrigramCF:

    from nltk.probability import FreqDist
    @classmethod
    def from_words(cls, words):
        wfd, wildfd, bfd, tfd = (FreqDist(),)*4
    
        for w1,w2,w3 in ingrams(words,3,pad_right=True):
          wfd.inc(w1)
    
          if w2 is None:
            continue
          bfd.inc((w1,w2))
    
          if w3 is None:
            continue
          wildfd.inc((w1,w3))
          tfd.inc((w1,w2,w3))
    
        return cls(wfd, bfd, wildfd, tfd)
    

    You could somehow hack it and try to hardcode for a 4-gram association finder as such:

    @classmethod
    def from_words(cls, words):
        wfd, wildfd = (FreqDist(),)*2
        bfd, tfd ,fofd = (FreqDist(),)*3
    
        for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
          wfd.inc(w1)
    
          if w2 is None:
            continue
          bfd.inc((w1,w2))
    
          if w3 is None:
            continue
          wildfd.inc((w1,w3))
          tfd.inc((w1,w2,w3))
    
          if w4 is None:
            continue
          wildfd.inc((w1,w4))
          wildfd.inc((w2,w4))
          wildfd.inc((w3,w4))
          wildfd.inc((w1,w3))
          wildfd.inc((w2,w3))
          wildfd.inc((w1,w2))
          ffd.inc((w1,w2,w3,w4))
    
        return cls(wfd, bfd, wildfd, tfd, ffd)
    

    Then you would also have to change whichever part of the code that uses cls returned from the from_words respectively.

    So you have to ask what is the ultimate purpose of finding the collocations?

    • If you're looking at retreiving words within collocations of larger than 2 or 3grams windows then you pretty much end up with a lot of noise in your word retrieval.

    • If you're going to build a model base on a collocation mode using 2 or 3grams windows then you will also face sparsity problems.

    0 讨论(0)
提交回复
热议问题