Counting bi-gram frequencies

后端未结

关注

 4  458

感情败类 2021-02-06 17:10

I\'ve written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I\'d like to alter it so that it can count bi-gram f

4条回答

小鲜肉 (楼主)

2021-02-06 18:04

I've rewritten the first bit for you, because it's icky. Points to note:

List comprehensions are your friend, use more of them.
collections.Counter is great!

OK, code:

import re
import nltk
import collections

# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')

# create list of lower case words
word_list = re.split('\s+', open(filename).read().lower())
print 'Words in text:', len(word_list)

words = (punctuation.sub("", word).strip() for word in word_list)
words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))

# create dictionary of word:frequency pairs
frequencies = collections.Counter(words)

print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
print frequencies

# display result as top 10 most frequent words
print frequencies.most_common(10)

[word for word, frequency in frequencies.most_common(10)]

0 讨论(0)

查看其它4个回答