Understanding NLTK collocation scoring for bigrams and trigrams

前端 未结 1 754
一生所求
一生所求 2021-01-30 03:31

Background:

I am trying to compare pairs of words to see which pair is \"more likely to occur\" in US English than another pair. My plan is/was to use

相关标签:
1条回答
  • 2021-01-30 04:12

    The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html

    You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.

    import nltk.collocations
    import nltk.corpus
    import collections
    
    bgm    = nltk.collocations.BigramAssocMeasures()
    finder = nltk.collocations.BigramCollocationFinder.from_words(
        nltk.corpus.brown.words())
    scored = finder.score_ngrams( bgm.likelihood_ratio  )
    
    # Group bigrams by first word in bigram.                                        
    prefix_keys = collections.defaultdict(list)
    for key, scores in scored:
       prefix_keys[key[0]].append((key[1], scores))
    
    # Sort keyed bigrams by strongest association.                                  
    for key in prefix_keys:
       prefix_keys[key].sort(key = lambda x: -x[1])
    
    print 'doctor', prefix_keys['doctor'][:5]
    print 'baseball', prefix_keys['baseball'][:5]
    print 'happy', prefix_keys['happy'][:5]
    

    The output seems reasonable, works well for baseball, less so for doctor and happy.

    doctor [('bills', 35.061321987405748), (',', 22.963930079491501), 
      ('annoys', 19.009636692022365), 
      ('had', 16.730384189212423), ('retorted', 15.190847940499127)]
    
    baseball [('game', 32.110754519752291), ('cap', 27.81891372457088), 
      ('park', 23.509042621473505), ('games', 23.105033513054011), 
      ("player's",    16.227872863424668)]
    
    happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589), 
     ('family', 13.734352182441569), 
     (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)
    
    0 讨论(0)
提交回复
热议问题