I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:
\"Cystic fibrosis affects 30,000 children and young adults in the US a
A short Pythonesque solution from this blog:
def find_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
Usage:
>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]
Though the post is old, I thought to mention my answer here so that most of the ngrams creation logic can be in one post.
There is something by name TextBlob in Python. It creates ngrams very easily similar to NLTK.
Below is the code snippet with its output for easy understanding.
sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)
And the output is :
unigrams
[WordList(['This']),
WordList(['is']),
WordList(['to']),
WordList(['show']),
WordList(['the']),
WordList(['usage']),
WordList(['of']),
WordList(['Text']),
WordList(['Blob']),
WordList(['in']),
WordList(['Python'])]
bigrams
[WordList(['This', 'is']),
WordList(['is', 'to']),
WordList(['to', 'show']),
WordList(['show', 'the']),
WordList(['the', 'usage']),
WordList(['usage', 'of']),
WordList(['of', 'Text']),
WordList(['Text', 'Blob']),
WordList(['Blob', 'in']),
WordList(['in', 'Python'])]
trigrams
[WordList(['This', 'is', 'to']),
WordList(['is', 'to', 'show']),
WordList(['to', 'show', 'the']),
WordList(['show', 'the', 'usage']),
WordList(['the', 'usage', 'of']),
WordList(['usage', 'of', 'Text']),
WordList(['of', 'Text', 'Blob']),
WordList(['Text', 'Blob', 'in']),
WordList(['Blob', 'in', 'Python'])]
As simple as that.
There is more to this that are being done by TextBlob. Please have a look at this doc for more details - https://textblob.readthedocs.io/en/dev/
There is one more interesting module into python called Scikit. Here is the code. This will help u to get all the grams given in a particular range. Here is the code
from sklearn.feature_extraction.text import CountVectorizer
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)
Output is
[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']
Here it gives all the grams given in a range 1 to 6. Its using the method called countVectorizer. Here is the link for that.
Use NLTK (the Natural Language Toolkit) and use the functions to tokenize (split) your text into a list and then find bigrams and trigrams.
import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)
Assuming input is a string contains space separated words, like x = "a b c d"
you can use the following function (edit: see the last function for a possibly more complete solution):
def ngrams(input, n):
input = input.split(' ')
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
If you want those joined back into strings, you might call something like:
[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
Lastly, that doesn't summarize things into totals, so if your input was 'a a a a'
, you need to count them up into a dict:
for g in (' '.join(x) for x in ngrams(input, 2)):
grams.setdefault(g, 0)
grams[g] += 1
Putting that all together into one final function gives:
def ngrams(input, n):
input = input.split(' ')
output = {}
for i in range(len(input)-n+1):
g = ' '.join(input[i:i+n])
output.setdefault(g, 0)
output[g] += 1
return output
ngrams('a a a a', 2) # {'a a': 3}
Using collections.deque
:
from collections import deque
from itertools import islice
def ngrams(message, n=1):
it = iter(message.split())
window = deque(islice(it, n), maxlen=n)
yield tuple(window)
for item in it:
window.append(item)
yield tuple(window)
...or maybe you could do it in one line as a list comprehension:
n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n)]