Counting phrases in Python using NLTK

我只是一个虾纸丫 提交于 2019-12-04 21:40:26

You can get all the two word phrases using the collocations module. This tool identifies words that often appear consecutively within corpora.

To find the two word phrases you need to first calculate the frequencies of words and their appearance in the context of other words. NLTK has a BigramCollocationFinder class that can do this. Here's how we can find the Bigram Collocations:

import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

frequency = {}
document_text = open('Words.txt', 'r')
text_string =
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

finder = BigramCollocationFinder.from_words(match_pattern)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 2))

NLTK Collocations Docs:

nltk.brigrams returns a pair of words and their frequency in an specific text. Try this:

import nltk
from nltk import bigrams

document_text = open('Words.txt', 'r')
text_string =
tokens = word_tokenize(text_string)
result = bigrams(tokens)


[(('w1', 'w2'), 6), (('w3', 'w4'), 3), (('w5', 'w6'), 3), (('w7', 'w8'), 3)...]