I am working with a CountVectorizer from scikit learn, and I\'m possibly attempting to do some things that the object was not made for...but I\'m not sure.
In terms of g
The parameter you want is called ngram_range
. You pass in a tuple (1,2)
to the constructor to get unigrams and bigrams. However, the vocabulary you pass in needs to be a dict
with ngrams as keys and integers as values.
In [20]: print CountVectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['I want to run away!']).A
[[0 0 1]]
Note the default tokeniser removes the exclamation mark at the end, so the last token is away
. If you want more control over how the string is broken up into tokens, follow @BrenBarn's comment.