Counting bi-gram frequencies

后端 未结 4 458
感情败类
感情败类 2021-02-06 17:10

I\'ve written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I\'d like to alter it so that it can count bi-gram f

4条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-02-06 18:04

    I've rewritten the first bit for you, because it's icky. Points to note:

    1. List comprehensions are your friend, use more of them.
    2. collections.Counter is great!

    OK, code:

    import re
    import nltk
    import collections
    
    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
    
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    
    # create list of lower case words
    word_list = re.split('\s+', open(filename).read().lower())
    print 'Words in text:', len(word_list)
    
    words = (punctuation.sub("", word).strip() for word in word_list)
    words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))
    
    # create dictionary of word:frequency pairs
    frequencies = collections.Counter(words)
    
    print '-'*30
    
    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    print frequencies
    
    # display result as top 10 most frequent words
    print frequencies.most_common(10)
    
    [word for word, frequency in frequencies.most_common(10)]
    

提交回复
热议问题