NLTK collocations for specific words

后端 未结 3 1810
余生分开走
余生分开走 2020-12-08 08:24

I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below.

I\'m not sure however about (1) how to get the c

相关标签:
3条回答
  • As for question #2, yes! NLTK has the Likelihood-Ratio in its association measure. The first question remains unanswered!

    http://nltk.org/api/nltk.metrics.html?highlight=likelihood_ratio#nltk.metrics.association.NgramAssocMeasures.likelihood_ratio

    0 讨论(0)
  • 2020-12-08 09:20

    Question 1 - Try:

    target_word = "electronic" # your choice of word
    finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3))
    for i in finder.score_ngrams(trigram_measures.likelihood_ratio):
    print i
    

    The idea is to filter out whatever you don't want. This method is normally used to filter out words in specific parts of the ngram, and you can tweak that to your heart's content.

    0 讨论(0)
  • 2020-12-08 09:28

    Try this code:

    import nltk
    from nltk.collocations import *
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    
    # Ngrams with 'creature' as a member
    creature_filter = lambda *w: 'creature' not in w
    
    
    ## Bigrams
    finder = BigramCollocationFinder.from_words(
       nltk.corpus.genesis.words('english-web.txt'))
    # only bigrams that appear 3+ times
    finder.apply_freq_filter(3)
    # only bigrams that contain 'creature'
    finder.apply_ngram_filter(creature_filter)
    # return the 10 n-grams with the highest PMI
    print finder.nbest(bigram_measures.likelihood_ratio, 10)
    
    
    ## Trigrams
    finder = TrigramCollocationFinder.from_words(
       nltk.corpus.genesis.words('english-web.txt'))
    # only trigrams that appear 3+ times
    finder.apply_freq_filter(3)
    # only trigrams that contain 'creature'
    finder.apply_ngram_filter(creature_filter)
    # return the 10 n-grams with the highest PMI
    print finder.nbest(trigram_measures.likelihood_ratio, 10)
    

    It uses the likelihood measure and also filters out Ngrams that don't contain the word 'creature'

    0 讨论(0)
提交回复
热议问题