I\'m trying to train a PassiveAggressiveClassifier
using TfidVectorizer
with partial_fit
technique in the script below:
Co
This is what i understand from your problem.
1) You have a requirement to apply the partial fit model to do the online training.
2) Your feature space is so huge.
If I got it right then I faced the same problem. And if you will use the HashingVectorizer, there are high chances of key collision.
HashingVectorizer doc
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary): there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model. there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems). no IDF weighting as this would render the transformer stageful.
If key will collide then there are chances of reduction in accuracy.
In my online training, firstly i trained the classifier with partial_fit like this.
classifier = MultinomialNB(alpha=alpha_optimized).partial_fit(X_train_tfidf,y_train,classes=np.array([0,1]))
On second day i load the pickled classifier, count_vect and tfidf of first day training set. Then I only applied the transform on count_vet and tfidf. And it worked
X_train_counts = count_vect.transform(x_train)
X_train_tfidf = tfidf.transform(X_train_counts)
pf_classifier.partial_fit(X_train_tfidf,y_train)
In case of any doubt please reply.