Computing N Grams using Python

后端 未结 8 1716
情歌与酒
情歌与酒 2020-11-28 06:02

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

\"Cystic fibrosis affects 30,000 children and young adults in the US a

相关标签:
8条回答
  • 2020-11-28 06:16

    A short Pythonesque solution from this blog:

    def find_ngrams(input_list, n):
      return zip(*[input_list[i:] for i in range(n)])
    

    Usage:

    >>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
    >>> find_ngrams(input_list, 1)
    [('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
    >>> find_ngrams(input_list, 2)
    [('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
    >>> find_ngrams(input_list, 3))
    [('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]
    
    0 讨论(0)
  • 2020-11-28 06:19

    Though the post is old, I thought to mention my answer here so that most of the ngrams creation logic can be in one post.

    There is something by name TextBlob in Python. It creates ngrams very easily similar to NLTK.

    Below is the code snippet with its output for easy understanding.

    sent = """This is to show the usage of Text Blob in Python"""
    blob = TextBlob(sent)
    unigrams = blob.ngrams(n=1)
    bigrams = blob.ngrams(n=2)
    trigrams = blob.ngrams(n=3)
    

    And the output is :

    unigrams
    [WordList(['This']),
     WordList(['is']),
     WordList(['to']),
     WordList(['show']),
     WordList(['the']),
     WordList(['usage']),
     WordList(['of']),
     WordList(['Text']),
     WordList(['Blob']),
     WordList(['in']),
     WordList(['Python'])]
    
    bigrams
    [WordList(['This', 'is']),
     WordList(['is', 'to']),
     WordList(['to', 'show']),
     WordList(['show', 'the']),
     WordList(['the', 'usage']),
     WordList(['usage', 'of']),
     WordList(['of', 'Text']),
     WordList(['Text', 'Blob']),
     WordList(['Blob', 'in']),
     WordList(['in', 'Python'])]
    
    trigrams
    [WordList(['This', 'is', 'to']),
     WordList(['is', 'to', 'show']),
     WordList(['to', 'show', 'the']),
     WordList(['show', 'the', 'usage']),
     WordList(['the', 'usage', 'of']),
     WordList(['usage', 'of', 'Text']),
     WordList(['of', 'Text', 'Blob']),
     WordList(['Text', 'Blob', 'in']),
     WordList(['Blob', 'in', 'Python'])]
    

    As simple as that.

    There is more to this that are being done by TextBlob. Please have a look at this doc for more details - https://textblob.readthedocs.io/en/dev/

    0 讨论(0)
  • 2020-11-28 06:23

    There is one more interesting module into python called Scikit. Here is the code. This will help u to get all the grams given in a particular range. Here is the code

    from sklearn.feature_extraction.text import CountVectorizer 
    text = "this is a foo bar sentences and i want to ngramize it"
    vectorizer = CountVectorizer(ngram_range=(1,6))
    analyzer = vectorizer.build_analyzer()
    print analyzer(text)
    

    Output is

    [u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']
    

    Here it gives all the grams given in a range 1 to 6. Its using the method called countVectorizer. Here is the link for that.

    0 讨论(0)
  • 2020-11-28 06:26

    Use NLTK (the Natural Language Toolkit) and use the functions to tokenize (split) your text into a list and then find bigrams and trigrams.

    import nltk
    words = nltk.word_tokenize(my_text)
    my_bigrams = nltk.bigrams(words)
    my_trigrams = nltk.trigrams(words)
    
    0 讨论(0)
  • 2020-11-28 06:29

    Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

    def ngrams(input, n):
        input = input.split(' ')
        output = []
        for i in range(len(input)-n+1):
            output.append(input[i:i+n])
        return output
    
    ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
    

    If you want those joined back into strings, you might call something like:

    [' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
    

    Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

    for g in (' '.join(x) for x in ngrams(input, 2)):
        grams.setdefault(g, 0)
        grams[g] += 1
    

    Putting that all together into one final function gives:

    def ngrams(input, n):
       input = input.split(' ')
       output = {}
       for i in range(len(input)-n+1):
           g = ' '.join(input[i:i+n])
           output.setdefault(g, 0)
           output[g] += 1
        return output
    
    ngrams('a a a a', 2) # {'a a': 3}
    
    0 讨论(0)
  • 2020-11-28 06:33

    Using collections.deque:

    from collections import deque
    from itertools import islice
    
    def ngrams(message, n=1):
        it = iter(message.split())
        window = deque(islice(it, n), maxlen=n)
        yield tuple(window)
        for item in it:
            window.append(item)
            yield tuple(window)
    

    ...or maybe you could do it in one line as a list comprehension:

    n = 2
    message = "Hello, how are you?".split()
    myNgrams = [message[i:i+n] for i in range(len(message) - n)]
    
    0 讨论(0)
提交回复
热议问题