I am processing some data and I have stored the results in three dictionaries, and I have saved them to the disk with Pickle. Each dictionary has 500-1000MB.
Now I am l
This is an inherent problem of pickle, which is intended for use with rather small amounts of data. The size of the dictionaries, when loaded into memory, are many times larger than on disk.
After loading a pickle file of 100MB, you may well have a dictionary of almost 1GB or so. There are some formulas on the web to calculate the overhead, but I can only recommend to use some decent database like MySQL or PostgreSQL for such amounts of Data.
I supose you use 32bits Python and it has 4GB limited. You should use 64 bits instead of 32 bits. I have try it, my pickled dict beyond 1.7GB, and I didn't get any problem except time goes longer.
If your data in the dictionaries are numpy
arrays, there are packages (such as joblib
and klepto
) that make pickling large arrays efficient, as both the klepto
and joblib
understand how to use minimal state representation for a numpy.array
. If you don't have array
data, my suggestion would be to use klepto
to store the dictionary entries in several files (instead of a single file) or to a database.
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file, would like to save/load your data in parallel, or would like to easily experiment with a storage format and backend to see which works best for your case. Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
As the links above discuss, you could use klepto
-- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto
also enables you to pick a storage format (pickle
, json
, etc.) --also HDF5
(or a SQL database) is another good option as it allows parallel access. klepto
can utilize both specialized pickle formats (like numpy
's) and compression (if you care about size and not speed of accessing the data).
klepto
gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel. For examples, see the above links.