发表新帖

发表新帖

Pass tokens to CountVectorizer

前端未结

关注

 3  1350

天涯浪人 2021-02-13 22:27

I have a text classification problem where i have two types of features:

features which are n-grams (extracted by CountVectorizer)
other textual features

3条回答

清酒与你 (楼主)

2021-02-13 22:50
In general, you can pass a custom tokenizer parameter to CountVectorizer. The tokenizer should be a function that takes a string and returns an array of its tokens. However, if you already have your tokens in arrays, you can simply make a dictionary of the token arrays with some arbitrary key and have your tokenizer return from that dictionary. Then, when you run CountVectorizer, just pass your dictionary keys. For example,
```
 # arbitrary token arrays and their keys
 custom_tokens = {"hello world": ["here", "is", "world"],
                  "it is possible": ["yes it", "is"]}

 CV = CountVectorizer(
      # so we can pass it strings
      input='content',
      # turn off preprocessing of strings to avoid corrupting our keys
      lowercase=False,
      preprocessor=lambda x: x,
      # use our token dictionary
      tokenizer=lambda key: custom_tokens[key])

 CV.fit(custom_tokens.keys())
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题