Counting bi-gram frequencies

后端 未结 4 463
感情败类
感情败类 2021-02-06 17:10

I\'ve written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I\'d like to alter it so that it can count bi-gram f

相关标签:
4条回答
  • 2021-02-06 17:52

    Generalized to n-grams with optional padding, also uses defaultdict(int) for frequencies, to work in 2.6:

    from collections import defaultdict
    
    def ngrams(words, n=2, padding=False):
        "Compute n-grams with optional padding"
        pad = [] if not padding else [None]*(n-1)
        grams = pad + words + pad
        return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
    
    # grab n-grams
    words = ['the','cat','sat','on','the','dog','on','the','cat']
    for size, padding in ((3, 0), (4, 0), (2, 1)):
        print '\n%d-grams padding=%d' % (size, padding)
        print list(ngrams(words, size, padding))
    
    # show frequency
    counts = defaultdict(int)
    for ng in ngrams(words, 2, False):
        counts[ng] += 1
    
    print '\nfrequencies of bigrams:'
    for c, ng in sorted(((c, ng) for ng, c in counts.iteritems()), reverse=True):
        print c, ng
    

    Output:

    3-grams padding=0
    [('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), 
     ('on', 'the', 'dog'), ('the', 'dog', 'on'), ('dog', 'on', 'the'), 
     ('on', 'the', 'cat')]
    
    4-grams padding=0
    [('the', 'cat', 'sat', 'on'), ('cat', 'sat', 'on', 'the'), 
     ('sat', 'on', 'the', 'dog'), ('on', 'the', 'dog', 'on'), 
     ('the', 'dog', 'on', 'the'), ('dog', 'on', 'the', 'cat')]
    
    2-grams padding=1
    [(None, 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), 
     ('on', 'the'), ('the', 'dog'), ('dog', 'on'), ('on', 'the'), 
     ('the', 'cat'), ('cat', None)]
    
    frequencies of bigrams:
    2 ('the', 'cat')
    2 ('on', 'the')
    1 ('the', 'dog')
    1 ('sat', 'on')
    1 ('dog', 'on')
    1 ('cat', 'sat')
    
    0 讨论(0)
  • 2021-02-06 18:04

    I've rewritten the first bit for you, because it's icky. Points to note:

    1. List comprehensions are your friend, use more of them.
    2. collections.Counter is great!

    OK, code:

    import re
    import nltk
    import collections
    
    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
    
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    
    # create list of lower case words
    word_list = re.split('\s+', open(filename).read().lower())
    print 'Words in text:', len(word_list)
    
    words = (punctuation.sub("", word).strip() for word in word_list)
    words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))
    
    # create dictionary of word:frequency pairs
    frequencies = collections.Counter(words)
    
    print '-'*30
    
    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    print frequencies
    
    # display result as top 10 most frequent words
    print frequencies.most_common(10)
    
    [word for word, frequency in frequencies.most_common(10)]
    
    0 讨论(0)
  • 2021-02-06 18:12

    Life is much more easier if you start using NLTK's FreqDist function to do the counting. Also NLTK has bigram feature. Examples for both of them are in the following page.

    http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html

    0 讨论(0)
  • 2021-02-06 18:15

    This should get you started:

    def bigrams(words):
        wprev = None
        for w in words:
            yield (wprev, w)
            wprev = w
    

    Note that the first bigram is (None, w1) where w1 is the first word, so you have a special bigram that marks start-of-text. If you also want an end-of-text bigram, add yield (wprev, None) after the loop.

    0 讨论(0)
提交回复
热议问题