Pass tokens to CountVectorizer

前端 未结 3 1338
天涯浪人
天涯浪人 2021-02-13 22:27

I have a text classification problem where i have two types of features:

  • features which are n-grams (extracted by CountVectorizer)
  • other textual features
3条回答
  •  清酒与你
    2021-02-13 22:50

    In general, you can pass a custom tokenizer parameter to CountVectorizer. The tokenizer should be a function that takes a string and returns an array of its tokens. However, if you already have your tokens in arrays, you can simply make a dictionary of the token arrays with some arbitrary key and have your tokenizer return from that dictionary. Then, when you run CountVectorizer, just pass your dictionary keys. For example,

     # arbitrary token arrays and their keys
     custom_tokens = {"hello world": ["here", "is", "world"],
                      "it is possible": ["yes it", "is"]}
    
     CV = CountVectorizer(
          # so we can pass it strings
          input='content',
          # turn off preprocessing of strings to avoid corrupting our keys
          lowercase=False,
          preprocessor=lambda x: x,
          # use our token dictionary
          tokenizer=lambda key: custom_tokens[key])
    
     CV.fit(custom_tokens.keys())
    

提交回复
热议问题