问题
I have a TfidfVectorizer
that vectorizes collection of articles followed by feature selection.
vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)
Now, I want to store this and use it in other programs. I don't want to re-run the TfidfVectorizer()
and the feature selector on the training dataset. How do I do that? I know how to make a model persistent using joblib
but I wonder if this is the same as making a model persistent.
回答1:
You can simply use the built in pickle lib:
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))
and load it with:
vectorizer = pickle.load(open("vectorizer.pickle"), "rb"))
selector = pickle.load(open("selector.pickle"), "rb"))
Pickle will serialize the objects to disk and load them in memory again when you need it
pickle lib docs
回答2:
"Making an object persistent" basically means that you're going to dump the binary code stored in memory that represents the object in a file on the hard-drive, so that later on in your program or in any other program the object can be reloaded from the file in the hard drive into memory.
Either scikit-learn included joblib
or the stdlib pickle
and cPickle
would do the job.
I tend to prefer cPickle
because it is significantly faster. Using ipython's %timeit command:
>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
>>> t = TFIDF()
>>> t.fit_transform(['hello world'], ['this is a test'])
# generic serializer - deserializer test
>>> def dump_load_test(tfidf, serializer):
...: with open('vectorizer.bin', 'w') as f:
...: serializer.dump(tfidf, f)
...: with open('vectorizer.bin', 'r') as f:
...: return serializer.load(f)
# joblib has a slightly different interface
>>> def joblib_test(tfidf):
...: joblib.dump(tfidf, 'tfidf.bin')
...: return joblib.load('tfidf.bin')
# Now, time it!
>>> %timeit joblib_test(t)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit dump_load_test(t, pickle)
100 loops, best of 3: 2.16 ms per loop
>>> %timeit dump_load_test(t, cPickle)
1000 loops, best of 3: 879 µs per loop
Now if you want to store multiple objects in a single file, you can easily create a data structure to store them, then dump the data structure itself. This will work with tuple
, list
or dict
.
From the example of your question:
# train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)
# dump as a dict
data_struct = {'vectorizer': vectorizer, 'selector': selector}
# use the 'with' keyword to automatically close the file after the dump
with open('storage.bin', 'wb') as f:
cPickle.dump(data_struct, f)
Later or in another program, the following statements will bring back the data structure in your program's memory:
# reload
with open('storage.bin', 'rb') as f:
data_struct = cPickle.load(f)
vectorizer, selector = data_struct['vectorizer'], data_struct['selector']
# do stuff...
vectors = vectorizer.transform(...)
vec_sel = selector.transform(vectors)
回答3:
Here is my answer using joblib:
joblib.dump(vectorizer, 'vectroizer.pkl')
joblib.dump(selector, 'selector.pkl')
Later, I can load it and ready to go:
vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')
test = selector.trasnform(vectorizer.transform(['this is test']))
来源:https://stackoverflow.com/questions/32764991/how-do-i-store-a-tfidfvectorizer-for-future-use-in-scikit-learn