Pass tokens to CountVectorizer

你说的曾经没有我的故事 提交于 2020-01-13 08:58:29

问题


I have a text classification problem where i have two types of features:

  • features which are n-grams (extracted by CountVectorizer)
  • other textual features (e.g. presence of a word from a given lexicon). These features are different from n-grams since they should be a part of any n-gram extracted from the text.

Both types of features are extracted from the text's tokens. I want to run tokenization only once,and then pass these tokens to CountVectorizer and to the other presence features extractor. So, i want to pass a list of tokens to CountVectorizer, but is only accepts a string as a representation to some sample. Is there a way to pass an array of tokens?


回答1:


Summarizing the answers of @user126350 and @miroli and this link:

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

cv = CountVectorizer(
    tokenizer=dummy,
    preprocessor=dummy,
)  

docs = [
    ['hello', 'world', '.'],
    ['hello', 'world'],
    ['again', 'hello', 'world']
]

cv.fit(docs)
cv.get_feature_names()
# ['.', 'again', 'hello', 'world']

The one thing to keep in mind is to wrap the new tokenized document into a list before calling the transform() function so that it is handled as a single document instead of interpreting each token as a document:

new_doc = ['again', 'hello', 'world', '.']
v_1 = cv.transform(new_doc)
v_2 = cv.transform([new_doc])

v_1.shape
# (4, 4)

v_2.shape
# (1, 4)



回答2:


In general, you can pass a custom tokenizer parameter to CountVectorizer. The tokenizer should be a function that takes a string and returns an array of its tokens. However, if you already have your tokens in arrays, you can simply make a dictionary of the token arrays with some arbitrary key and have your tokenizer return from that dictionary. Then, when you run CountVectorizer, just pass your dictionary keys. For example,

 # arbitrary token arrays and their keys
 custom_tokens = {"hello world": ["here", "is", "world"],
                  "it is possible": ["yes it", "is"]}

 CV = CountVectorizer(
      # so we can pass it strings
      input='content',
      # turn off preprocessing of strings to avoid corrupting our keys
      lowercase=False,
      preprocessor=lambda x: x,
      # use our token dictionary
      tokenizer=lambda key: custom_tokens[key])

 CV.fit(custom_tokens.keys())



回答3:


Similar to user126350's answer, but even simpler, here's what I did.

def do_nothing(tokens):
    return tokens

pipe = Pipeline([
    ('tokenizer', MyCustomTokenizer()),
    ('vect', CountVectorizer(tokenizer=do_nothing,
                             preprocessor=None,
                             lowercase=False))
])

doc_vects = pipe.transform(my_docs)  # pass list of documents as strings


来源:https://stackoverflow.com/questions/35867484/pass-tokens-to-countvectorizer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!