I need to optimize the RAM usage of my application.
PLEASE spare me the lectures telling me I shouldn\'t care about memory when coding Python. I have a memory problem be
For a web application you should use a database, the way you're doing it you are creating one copy of your dict for each apache process, which is extremely wasteful. If you have enough memory on the server the database table will be cached in memory (if you don't have enough for one copy of your table, put more RAM into the server). Just remember to put correct indices on your database table or you will get bad performance.
If you want to stay with the in-memory data store, you could try something like memcached.
That way, you can use a single in-memory key/value-store from all the Python processes.
There are several python memcached client libraries.
Redis would be a great option here if you have the option to use it on a shared host - similar to memcached, but optimised for data structures. Redis also supports python bindings.
I use it on a day to day basis for number crunching but also in production systems as a datastore and cannot recommend it highly enough.
Also, do you have an option to proxy your app behind nginx instead of using Apache? You might find (if allowed) this proxy/webapp arrangement less hungry on resources.
Good luck.
I've had situations where I've had a collection of large objects that I've needed to sort and filter by different methods based on several metadata properties. I didn't need the larger parts of them so I dumped them to disk.
As you data is so simple in type, a quick SQLite database might solve all your problems, even speed things up a little.
Use shelve or a database to store the data instead of an in-memory dict.
I suggest the following: store all the values in a DB, and keep an in-memory dictionary with string hashes as keys. If a collision occurs, fetch values from the DB, otherwise (vast majority of the cases) use the dictionary. Effectively, it will be a giant cache.
A problem with dictionaries in Python is that they use a lot of space: even an int-int dictionary uses 45-80 bytes per key-value pair on a 32-bit system. At the same time, an array.array('i')
uses only 8 bytes per a pair of ints, and with a little bit of bookkeeping one can implement a reasonably fast array-based int → int dictionary.
Once you have a memory-efficient implementation of an int-int dictionary, split your string → (object, int, int) dictionary into three dictionaries and use hashes instead of full strings. You'll get one int → object and two int → int dictionaries. Emulate the int → object dictionary as follows: keep a list of objects and store indexes of the objects as values of an int → int dictionary.
I do realize there's a considerable amount of coding involved to get an array-based dictionary. I had had problem similar to yours and I have implemented a reasonably fast, very memory-efficient, generic hash-int dictionary. Here's my code (BSD license). It is array-based (8 bytes per pair), it takes care of key hashing and collision checking, it keeps the array (several smaller arrays, actually) ordered during writes and does binary search on reads. Your code is reduced to something like:
dictionary = HashIntDict(checking = HashIntDict.CHK_SHOUTING)
# ...
database.store(k, v)
try:
dictionary[k] = v
except CollisionError:
pass
# ...
try:
v = dictionary[k]
except CollisionError:
v = database.fetch(k)
The checking
parameter specifies what happens when a collision occurs: CHK_SHOUTING
raises CollisionError
on reads and writes, CHK_DELETING
returns None
on reads and remains silent on writes, CHK_IGNORING
doesn't do collision checking.
What follows is a brief description of my implementation, optimization hints are welcome! The top-level data structure is a regular dictionary of arrays. Each array contains up to 2^16 = 65536
integer pairs (square root of 2^32
). A key k
and a corresponding value v
are both stored in k/65536
-th array. The arrays are initialized on-demand and kept ordered by keys. Binary search is executed on each read and write. Collision checking is an option. If enabled, an attempt to overwrite an already existing key will remove the key and associated value from the dictionary, add the key to a set of colliding keys, and (again, optionally) raise an exception.