I have a text classification problem where i have two types of features:
Similar to user126350's answer, but even simpler, here's what I did.
def do_nothing(tokens):
return tokens
pipe = Pipeline([
('tokenizer', MyCustomTokenizer()),
('vect', CountVectorizer(tokenizer=do_nothing,
preprocessor=None,
lowercase=False))
])
doc_vects = pipe.transform(my_docs) # pass list of documents as strings