Shelve is probably not a good choice, however...
You might try using klepto
or joblib
. Both are good at caching results, and can use efficient storage formats.
Both joblib
and klepto
can save your results to a file on disk, or to a directory. Both can also leverage the numpy
internal storage format and/or compression on save… and also save to memory mapped files, if you like.
If you use klepto
, it takes the dictionary key as the filename, and saves the value as the contents. With klepto
, you can also pick whether you want to use pickle
or json
or some other storage format.
Python 2.7.7 (default, Jun 2 2014, 01:33:50)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> data_dict = klepto.archives.dir_archive('storage', cached=False, serialized=True)
>>> import string
>>> import random
>>> for j in string.ascii_letters:
... for k in range(1000):
... data_dict.setdefault(j, []).append([int(10*random.random()) for i in range(3)])
...
>>>
This will create a directory called storage
that contains pickled files, one for each key of your data_dict
. There are keywords for using memmap
files, and also for compression level. If you choose cached=False
, then instead of dumping to file each time you wrote to data_dict
, you'd write to memory each time… and you could then use data_dict.dump()
to dump to disk whenever you choose… or you could pick a memory limit that when you hit it, you'd dump to disk. Additionally, you can also pick a caching strategy (like lru
or lfu
) for deciding which keys you would purge from memory and dump to disk.
Get klepto
here: https://github.com/uqfoundation
or get joblib
here: https://github.com/joblib/joblib
If you refactor, you could probably come up with a way to do this so it could take advantage of a pre-allocated array. However, it might depend on the profile of how your code runs.
Does opening and closing files affect run time? Yes. If you use klepto
, you can set the granularity of when you want to dump to disk. Then you can pick a trade-off of speed versus intermediate storage of results.