How do I store a TfidfVectorizer for future use in scikit-learn?

。_饼干妹妹 提交于 2019-11-28 06:57:36

You can simply use the built in pickle lib:

pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))

and load it with:

vectorizer = pickle.load(open("vectorizer.pickle"), "rb"))
selector = pickle.load(open("selector.pickle"), "rb"))

Pickle will serialize the objects to disk and load them in memory again when you need it

pickle lib docs

"Making an object persistent" basically means that you're going to dump the binary code stored in memory that represents the object in a file on the hard-drive, so that later on in your program or in any other program the object can be reloaded from the file in the hard drive into memory.

Either scikit-learn included joblib or the stdlib pickle and cPickle would do the job. I tend to prefer cPickle because it is significantly faster. Using ipython's %timeit command:

>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
>>> t = TFIDF()
>>> t.fit_transform(['hello world'], ['this is a test'])

# generic serializer - deserializer test
>>> def dump_load_test(tfidf, serializer):
...:    with open('vectorizer.bin', 'w') as f:
...:        serializer.dump(tfidf, f)
...:    with open('vectorizer.bin', 'r') as f:
...:        return serializer.load(f)

# joblib has a slightly different interface
>>> def joblib_test(tfidf):
...:    joblib.dump(tfidf, 'tfidf.bin')
...:    return joblib.load('tfidf.bin')

# Now, time it!
>>> %timeit joblib_test(t)
100 loops, best of 3: 3.09 ms per loop

>>> %timeit dump_load_test(t, pickle)
100 loops, best of 3: 2.16 ms per loop

>>> %timeit dump_load_test(t, cPickle)
1000 loops, best of 3: 879 µs per loop

Now if you want to store multiple objects in a single file, you can easily create a data structure to store them, then dump the data structure itself. This will work with tuple, list or dict. From the example of your question:

# train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

# dump as a dict
data_struct = {'vectorizer': vectorizer, 'selector': selector}
# use the 'with' keyword to automatically close the file after the dump
with open('storage.bin', 'wb') as f: 
    cPickle.dump(data_struct, f)

Later or in another program, the following statements will bring back the data structure in your program's memory:

# reload
with open('storage.bin', 'rb') as f:
    data_struct = cPickle.load(f)
    vectorizer, selector = data_struct['vectorizer'], data_struct['selector']

# do stuff...
vectors = vectorizer.transform(...)
vec_sel = selector.transform(vectors)

Here is my answer using joblib:

joblib.dump(vectorizer, 'vectroizer.pkl')
joblib.dump(selector, 'selector.pkl')

Later, I can load it and ready to go:

vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')

test = selector.trasnform(vectorizer.transform(['this is test']))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!