Pass tokens to CountVectorizer

前端 未结 3 1341
天涯浪人
天涯浪人 2021-02-13 22:27

I have a text classification problem where i have two types of features:

  • features which are n-grams (extracted by CountVectorizer)
  • other textual features
3条回答
  •  甜味超标
    2021-02-13 22:39

    Similar to user126350's answer, but even simpler, here's what I did.

    def do_nothing(tokens):
        return tokens
    
    pipe = Pipeline([
        ('tokenizer', MyCustomTokenizer()),
        ('vect', CountVectorizer(tokenizer=do_nothing,
                                 preprocessor=None,
                                 lowercase=False))
    ])
    
    doc_vects = pipe.transform(my_docs)  # pass list of documents as strings
    

提交回复
热议问题