Counting phrases in Python using NLTK

我只是一个虾纸丫 提交于 2019-12-04 21:40:26

You can get all the two word phrases using the collocations module. This tool identifies words that often appear consecutively within corpora.

To find the two word phrases you need to first calculate the frequencies of words and their appearance in the context of other words. NLTK has a BigramCollocationFinder class that can do this. Here's how we can find the Bigram Collocations:

import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

finder = BigramCollocationFinder.from_words(match_pattern)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 2))

NLTK Collocations Docs: http://www.nltk.org/api/nltk.html?highlight=collocation#module-nltk.collocations

nltk.brigrams returns a pair of words and their frequency in an specific text. Try this:

import nltk
from nltk import bigrams

document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
tokens = word_tokenize(text_string)
result = bigrams(tokens)

Output:

[(('w1', 'w2'), 6), (('w3', 'w4'), 3), (('w5', 'w6'), 3), (('w7', 'w8'), 3)...]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!