I am working with a CountVectorizer from scikit learn, and I'm possibly attempting to do some things that the object was not made for...but I'm not sure.
In terms of getting counts for occurrence:
vocabulary = ['hi', 'bye', 'run away!']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
gives:
[[0 0 0 0]]
What I'm realizing is that the CountVectorizer will break the corpus into what I believe is unigrams:
vocabulary = ['hi', 'bye', 'run']
corpus = ['run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
which gives:
[[0 0 1]]
Is there any way to tell the CountVectorizer exactly how you'd like to vectorize the corpus? Ideally I would like an outcome along the lines of the first example.
In all honestly, however, I'm wondering if it is at all possible to get an outcome along these lines:
vocabulary = ['hi', 'bye', 'run away!']
corpus = ['I want to run away!']
cv = CountVectorizer(vocabulary=vocabulary)
X = cv.fit_transform(corpus)
print X.toarray()
[[0 0 1]]
I don't see much information in the documentation for the fit_transform method, which only takes one argument as it is. If anyone has any ideas I would be grateful. Thanks!
The parameter you want is called ngram_range
. You pass in a tuple (1,2)
to the constructor to get unigrams and bigrams. However, the vocabulary you pass in needs to be a dict
with ngrams as keys and integers as values.
In [20]: print CountVectorizer(vocabulary={'hi': 0, u'bye': 1, u'run away': 2}, ngram_range=(1,2)).fit_transform(['I want to run away!']).A
[[0 0 1]]
Note the default tokeniser removes the exclamation mark at the end, so the last token is away
. If you want more control over how the string is broken up into tokens, follow @BrenBarn's comment.
来源:https://stackoverflow.com/questions/24007812/can-i-control-the-way-the-countvectorizer-vectorizes-the-corpus-in-scikit-learn