I\'ve written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I\'d like to alter it so that it can count bi-gram f
I've rewritten the first bit for you, because it's icky. Points to note:
collections.Counter
is great!OK, code:
import re
import nltk
import collections
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
# create list of lower case words
word_list = re.split('\s+', open(filename).read().lower())
print 'Words in text:', len(word_list)
words = (punctuation.sub("", word).strip() for word in word_list)
words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))
# create dictionary of word:frequency pairs
frequencies = collections.Counter(words)
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
print frequencies
# display result as top 10 most frequent words
print frequencies.most_common(10)
[word for word, frequency in frequencies.most_common(10)]