I\'m seeing very poor performance when fetching multiple keys from Memcache using ndb.get_multi()
in App Engine (Python).
I am fetching ~500 small objects,
I investigated this in a bit of detail, and the problem is ndb and Python, not memcache. The reason things are so incredibly slow is partly deserialization (explains about 30% of the time), and the rest seems to be overhead in ndb's task queue implementation.
This means that, if you really want to, you can avoid ndb and instead fetch and deserialize from memcache directly. In my test case with 500 small entities, this gives a massive 2.5x speedup (650ms vs 1600ms on an F1 instance in production, or 200ms vs 500ms on an F4 instance). This gist shows how to do it: https://gist.github.com/mcummins/600fa8852b4741fb2bb1
Here is the appstats output for the manual memcache fetch and deserialization:
Now compare this to fetching exactly the same entities using ndb.get_multi(keys)
:
Almost 3x difference!!
Profiling each step is shown below. Note the timings don't match appstats because they're running on an F1 instance, so real time is 3x clock time.
Manual version:
# memcache.get_multi: 50.0 ms
# Deserialization: 140.0 ms
# Number of keys is 521, fetch time per key is 0.364683301344 ms
vs ndb version:
# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms
So ndb takes 1ms per entity fetched, even if the entity has one single property and is in memcache. That's on an F4 instance. On an F1 instance it takes 3ms. This is a serious practical limitation: if you want to maintain reasonable latency, you can't fetch more than ~100 entities of any kind when handling a user request on an F1 instance.
Clearly ndb is doing something really expensive and (at least in this case) unnecessary. I think it has something to do with its task queue and all the futures it sets up. Whether it is worth going around ndb and doing things manually depends on your app. If you have some memcache misses then you will have to go do the datastore fetches. So you essentially end up partly reimplementing ndb. However, since ndb seems to have such massive overhead, this may be worth doing. At least it seems so based on my use case of a lot of get_multi calls for small objects, with a high expected memcache hit rate.
It also seems to suggest that if Google were to implement some key bits of ndb and/or deserialization as C modules, Python App Engine could be massively faster.