Pass tokens to CountVectorizer

前端未结

关注

 3  1341

天涯浪人 2021-02-13 22:27

I have a text classification problem where i have two types of features:

features which are n-grams (extracted by CountVectorizer)
other textual features

3条回答

甜味超标 (楼主)

2021-02-13 22:39

Similar to user126350's answer, but even simpler, here's what I did.

def do_nothing(tokens):
    return tokens

pipe = Pipeline([
    ('tokenizer', MyCustomTokenizer()),
    ('vect', CountVectorizer(tokenizer=do_nothing,
                             preprocessor=None,
                             lowercase=False))
])

doc_vects = pipe.transform(my_docs)  # pass list of documents as strings

0 讨论(0)

查看其它3个回答