I am trying to get a phrase count from a text file but so far I am only able to obtain a word count (see below). I need to extend this logic to count the number of times a two-word phrase appears in the text file.
Phrases can be defined/grouped by using logic from NLTK from my understanding. I believe the collections function is what I need to obtain the desired result, but I'm not sure how to go about implementing it from reading the NLTK documentation. Any tips/help would be greatly appreciated.
import re
import string
frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
print (words, frequency[words])
You can get all the two word phrases using the collocations
module. This tool identifies words that often appear consecutively within corpora.
To find the two word phrases you need to first calculate the frequencies of words and their appearance in the context of other words. NLTK has a BigramCollocationFinder
class that can do this. Here's how we can find the Bigram Collocations:
import re
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
frequency = {}
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
finder = BigramCollocationFinder.from_words(match_pattern)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 2))
NLTK Collocations Docs: http://www.nltk.org/api/nltk.html?highlight=collocation#module-nltk.collocations
nltk.brigrams
returns a pair of words and their frequency in an specific text. Try this:
import nltk
from nltk import bigrams
document_text = open('Words.txt', 'r')
text_string = document_text.read().lower()
tokens = word_tokenize(text_string)
result = bigrams(tokens)
Output:
[(('w1', 'w2'), 6), (('w3', 'w4'), 3), (('w5', 'w6'), 3), (('w7', 'w8'), 3)...]
来源:https://stackoverflow.com/questions/39691816/counting-phrases-in-python-using-nltk