Is there a more efficient way to find most common n-grams?

后端 未结 1 1349
悲&欢浪女
悲&欢浪女 2021-02-14 08:06

I\'m trying to find k most common n-grams from a large corpus. I\'ve seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a

1条回答
  •  梦如初夏
    2021-02-14 08:50

    In Python, using NLTK:

    $ wget http://norvig.com/big.txt
    $ python
    >>> from collections import Counter
    >>> from nltk import ngrams
    >>> bigtxt = open('big.txt').read()
    >>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
    >>> ngram_counts.most_common(10)
    [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
    

    In Python, native (see Fast/Optimize N-gram implementations in python):

    >>> import collections
    >>> def ngrams(text, n=2):
    ...     return zip(*[text[i:] for i in range(n)])
    >>> ngram_counts = collections.Counter(ngrams(bigtxt.split(), 2))
    >>> ngram_counts.most_common(10)
        [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]
    

    In Julia, see Generate ngrams with Julia

    import StatsBase: countmap
    import Iterators: partition
    bigtxt = readstring(open("big.txt"))
    ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))
    

    Rough timing:

    $ time python ngram-test.py # With NLTK.
    
    real    0m3.166s
    user    0m2.274s
    sys 0m0.528s
    
    $ time python ngram-native-test.py 
    
    real    0m1.521s
    user    0m1.317s
    sys 0m0.145s
    
    $ time julia ngram-test.jl 
    
    real    0m3.573s
    user    0m3.188s
    sys 0m0.306s
    

    0 讨论(0)
提交回复
热议问题