Incremental Learning in Scikit with PassiveAggressiveClassifier's partial_fit

后端 未结 4 1296
长发绾君心
长发绾君心 2021-01-23 09:07

I\'m trying to train a PassiveAggressiveClassifier using TfidVectorizer with partial_fit technique in the script below:

Co

4条回答
  •  [愿得一人]
    2021-01-23 09:28

    For those who HashingVectorizer doesn't meet their needs, see a possible alternative in my answer to this related question here. It's basically a custom implementation of partial_fit for TfidfVectorizer and CountVectorizer.

    Two comments relating to the specific discussion here:

    • OP's issue requires that the dimension of the output vector be identical after every call of partial_fit. In general it is expected that every Scikit-Learn estimator that implements partial_fit be able to work within a pipeline after that partial_fit is called, so for vectorizers this means not changing the output dimension since other estimators in the pipeline may not necessarily be able to handle the change. I think this is why partial_fit has not yet been implemented in Scikit-Learn for these vectorizers (see discussion on an active PR), since partial_fit will presumably update the vocabulary which will definitely change the output dimension.

    • So the solution proposed by my answer (a partial_fit method for TfidfVectorizer) would only solve the first part of OP's needs which is incremental learning. To solve the second part it may be possible to pad the output sequence with zeros into a predetermined vector. It's not ideal, since it would fail when the vocabulary exceeds that limit.

提交回复
热议问题