I have a text classification problem where i have two types of features:
In general, you can pass a custom tokenizer
parameter to CountVectorizer
. The tokenizer should be a function that takes a string and returns an array of its tokens. However, if you already have your tokens in arrays, you can simply make a dictionary of the token arrays with some arbitrary key and have your tokenizer return from that dictionary. Then, when you run CountVectorizer, just pass your dictionary keys. For example,
# arbitrary token arrays and their keys
custom_tokens = {"hello world": ["here", "is", "world"],
"it is possible": ["yes it", "is"]}
CV = CountVectorizer(
# so we can pass it strings
input='content',
# turn off preprocessing of strings to avoid corrupting our keys
lowercase=False,
preprocessor=lambda x: x,
# use our token dictionary
tokenizer=lambda key: custom_tokens[key])
CV.fit(custom_tokens.keys())