Pass tokens to CountVectorizer

前端 未结 3 1339
天涯浪人
天涯浪人 2021-02-13 22:27

I have a text classification problem where i have two types of features:

  • features which are n-grams (extracted by CountVectorizer)
  • other textual features
相关标签:
3条回答
  • 2021-02-13 22:39

    Similar to user126350's answer, but even simpler, here's what I did.

    def do_nothing(tokens):
        return tokens
    
    pipe = Pipeline([
        ('tokenizer', MyCustomTokenizer()),
        ('vect', CountVectorizer(tokenizer=do_nothing,
                                 preprocessor=None,
                                 lowercase=False))
    ])
    
    doc_vects = pipe.transform(my_docs)  # pass list of documents as strings
    
    0 讨论(0)
  • 2021-02-13 22:45

    Summarizing the answers of @user126350 and @miroli and this link:

    from sklearn.feature_extraction.text import CountVectorizer
    
    def dummy(doc):
        return doc
    
    cv = CountVectorizer(
            tokenizer=dummy,
            preprocessor=dummy,
        )  
    
    docs = [
        ['hello', 'world', '.'],
        ['hello', 'world'],
        ['again', 'hello', 'world']
    ]
    
    cv.fit(docs)
    cv.get_feature_names()
    # ['.', 'again', 'hello', 'world']
    

    The one thing to keep in mind is to wrap the new tokenized document into a list before calling the transform() function so that it is handled as a single document instead of interpreting each token as a document:

    new_doc = ['again', 'hello', 'world', '.']
    v_1 = cv.transform(new_doc)
    v_2 = cv.transform([new_doc])
    
    v_1.shape
    # (4, 4)
    
    v_2.shape
    # (1, 4)
    
    0 讨论(0)
  • 2021-02-13 22:50

    In general, you can pass a custom tokenizer parameter to CountVectorizer. The tokenizer should be a function that takes a string and returns an array of its tokens. However, if you already have your tokens in arrays, you can simply make a dictionary of the token arrays with some arbitrary key and have your tokenizer return from that dictionary. Then, when you run CountVectorizer, just pass your dictionary keys. For example,

     # arbitrary token arrays and their keys
     custom_tokens = {"hello world": ["here", "is", "world"],
                      "it is possible": ["yes it", "is"]}
    
     CV = CountVectorizer(
          # so we can pass it strings
          input='content',
          # turn off preprocessing of strings to avoid corrupting our keys
          lowercase=False,
          preprocessor=lambda x: x,
          # use our token dictionary
          tokenizer=lambda key: custom_tokens[key])
    
     CV.fit(custom_tokens.keys())
    
    0 讨论(0)
提交回复
热议问题