问题
I am using Pipeline
from sklearn to classify text.
In this example Pipeline
, I have a TfidfVectorizer
and some custom features wrapped with FeatureUnion
and a classifier as the Pipeline
steps, I then fit the training data and do the prediction:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.
Here I need to pickle the TfidfVectorizer
step and leave the custom_features
unpickled, since I still do experiments with them. The idea is to make the pipeline faster by pickling the tfidf step.
I know I can pickle the whole Pipeline
with joblib.dump
, but how do I pickle individual steps?
回答1:
To pickle the TfidfVectorizer, you could use:
joblib.dump(pipeline.steps[0][1].transformer_list[0][1], dump_path)
or:
joblib.dump(pipeline.get_params()['features__tfidf'], dump_path)
To load the dumped object, you can use:
pipeline.steps[0][1].transformer_list[0][1] = joblib.load(dump_path)
Unfortunately you can't use set_params
, the inverse of get_params
, to insert the estimator by name. You will be able to if the changes in PR#1769: enable setting pipeline components as parameters are ever merged!
来源:https://stackoverflow.com/questions/36259967/how-to-pickle-individual-steps-in-sklearns-pipeline