How to get n-gram collocations and association in python nltk?

后端未结
关注
 2  1953
执念已碎 2021-02-04 16:47
In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeas

      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   闹比i
                                             
                
                
                (楼主)
            
              
              
                2021-02-04 17:22
              

            
            
                        
Edited

The current NLTK has a hardcoder function for up to QuadCollocationFinder but the reasoning for why you cannot simply create an NgramCollocationFinder still stands, you would have to radically change the formulas in the from_words() function for different order of ngram.



Short answer, no you cannot simply create an AbstractCollocationFinder (ACF) to call the nbest() function if you want to find collocations beyond 2- and 3-grams. 

It's because of the difference in the from_words() for different ngrams. You see that only the subclass of ACF (i.e. BigramCF and TrigramCF) have the from_words() function. 

>>> finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))
>>> finder = AbstractCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt',5))
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: type object 'AbstractCollocationFinder' has no attribute 'from_words'


So given this from_words() in TrigramCF:

from nltk.probability import FreqDist
@classmethod
def from_words(cls, words):
    wfd, wildfd, bfd, tfd = (FreqDist(),)*4

    for w1,w2,w3 in ingrams(words,3,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

    return cls(wfd, bfd, wildfd, tfd)


You could somehow hack it and try to hardcode for a 4-gram association finder as such:

@classmethod
def from_words(cls, words):
    wfd, wildfd = (FreqDist(),)*2
    bfd, tfd ,fofd = (FreqDist(),)*3

    for w1,w2,w3,w4,w5 in ingrams(words,5,pad_right=True):
      wfd.inc(w1)

      if w2 is None:
        continue
      bfd.inc((w1,w2))

      if w3 is None:
        continue
      wildfd.inc((w1,w3))
      tfd.inc((w1,w2,w3))

      if w4 is None:
        continue
      wildfd.inc((w1,w4))
      wildfd.inc((w2,w4))
      wildfd.inc((w3,w4))
      wildfd.inc((w1,w3))
      wildfd.inc((w2,w3))
      wildfd.inc((w1,w2))
      ffd.inc((w1,w2,w3,w4))

    return cls(wfd, bfd, wildfd, tfd, ffd)


Then you would also have to change whichever part of the code that uses cls returned from the from_words respectively. 

So you have to ask what is the ultimate purpose of finding the collocations? 


If you're looking at retreiving words within collocations of larger
than 2 or 3grams windows then you pretty much end up with a lot of
noise in your word retrieval. 
If you're going to build a model base on a collocation mode using 2
or 3grams windows then you will also face sparsity problems.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复