I have a text classification problem where i have two types of features:
Similar to user126350's answer, but even simpler, here's what I did.
def do_nothing(tokens):
return tokens
pipe = Pipeline([
('tokenizer', MyCustomTokenizer()),
('vect', CountVectorizer(tokenizer=do_nothing,
preprocessor=None,
lowercase=False))
])
doc_vects = pipe.transform(my_docs) # pass list of documents as strings
Summarizing the answers of @user126350 and @miroli and this link:
from sklearn.feature_extraction.text import CountVectorizer
def dummy(doc):
return doc
cv = CountVectorizer(
tokenizer=dummy,
preprocessor=dummy,
)
docs = [
['hello', 'world', '.'],
['hello', 'world'],
['again', 'hello', 'world']
]
cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']
The one thing to keep in mind is to wrap the new tokenized document into a list before calling the transform() function so that it is handled as a single document instead of interpreting each token as a document:
new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])
v_1.shape
# (4, 4)
v_2.shape
# (1, 4)
In general, you can pass a custom tokenizer
parameter to CountVectorizer
. The tokenizer should be a function that takes a string and returns an array of its tokens. However, if you already have your tokens in arrays, you can simply make a dictionary of the token arrays with some arbitrary key and have your tokenizer return from that dictionary. Then, when you run CountVectorizer, just pass your dictionary keys. For example,
# arbitrary token arrays and their keys
custom_tokens = {"hello world": ["here", "is", "world"],
"it is possible": ["yes it", "is"]}
CV = CountVectorizer(
# so we can pass it strings
input='content',
# turn off preprocessing of strings to avoid corrupting our keys
lowercase=False,
preprocessor=lambda x: x,
# use our token dictionary
tokenizer=lambda key: custom_tokens[key])
CV.fit(custom_tokens.keys())