I need to serialise scikit-learn/statsmodels models such that all the dependencies (code + data) are packaged in an artefact and this artefact can be used to initialise the mode
I package gaussian process (GP) from scikit-learn
using pickle
.
The primary reason is because the GP takes long time to build and loads much faster using pickle
. So in my code initialization I check whether the data files for model got updated and re-generate the model if necessary, otherwise just de-serialize it from pickle
!
I would use pickle
, dill
, cloudpickle
in the respective order.
Note that pickle
includes protocol
keyword argument and some values can speed up and reduce memory usage significantly!
Finally I wrap pickle code with compression from CPython STL if necessary.
I'm the dill
author. dill
was built to do exactly what you are doing… (to persist numerical fits within class instances for statistics) where these objects can then be distributed to different resources and run in an embarrassingly parallel fashion. So, the answer is yes -- I have run code like yours, using mystic and/or sklearn.
Note that many of the authors of sklearn
use cloudpickle
for enabling parallel computing on sklearn
objects, and not dill
. dill
can pickle more types of objects than cloudpickle
, however cloudpickle
is slightly better (at this time of writing) at pickling objects that make references to the global dictionary as part of a closure -- by default, dill
does this by reference, while cloudpickle
physically stores the dependencies. However, dill
has a "recurse"
mode, that acts like cloudpickle
, so the difference when using this mode is minor. (To enable "recurse"
mode, do dill.settings['recurse'] = True
, or use recurse=True
as a flag in dill.dump
). Another minor difference is that cloudpickle
contains special support for things like scikits.timeseries
and PIL.Image
, while dill
does not.
On the plus side, dill
does not pickle classes by reference, so by pickling a class instance, it serializes the class object itself -- which is a big advantage, as it serializes instances of derived classes of classifiers, models, and etc from sklearn
in their exact state at the time of pickling… so if you make modifications to the class object, the instance still unpicks correctly. There are other advantages of dill
over cloudpickle
, aside from the broader range of objects (and typically a smaller pickle) -- however, I won't list them here. You asked for pitfalls, so differences are not pitfalls.
Major pitfalls:
You should have anything your classes refer to installed on the
remote machine, just in case dill
(or cloudpickle
) pickles it by
reference.
You should try to make your classes and class methods as self-contained as possible (e.g. don't refer to objects defined in the global scope from your classes).
sklearn
objects can be big, so saving many of them to a single
pickle is not always a good idea… you might want to use klepto
which has a dict
interface to caching and archiving, and enables you to configure the archive interface to store each key-value pair individually (e.g. one entry per file).
Ok to begin with, in your sample code pickle
could work fine, I use pickle all the time to package a model and use it later, unless you want to send the model directly to another server or save the interpreter state
, because that is what Dill
is good at and pickle
can not do. It also depends on your code, what types etc. you use, pickle
might fail, Dill
is more stable.
Dill
is primarly based on pickle
and so they are very similar, some things you should take into account / look into:
Limitations of Dill
frame
, generator
, traceback
standard types can not be packaged.
cloudpickle
might be a good idea for your problem as well, it has better support in pickling objects (than pickle, not per see better than Dill) and you can pickle code easily as well.
Once the target machine has the correct libraries loaded, (be carefull for different python
versions as well, because they may bug your code), everything should work fine with both Dill
and cloudpickle
, as long as you do not use the unsuported standard types.
Hope this helps.