Best practice to query large number of ndb entities from datastore

后端 未结 4 1127
有刺的猬
有刺的猬 2020-12-07 07:11

I have run into an interesting limit with the App Engine datastore. I am creating a handler to help us analyze some usage data on one of our production servers. To perform

相关标签:
4条回答
  • 2020-12-07 07:57

    I have a similar problem and after working with Google support for few weeks I can confirm there is no magic solution at least as of December 2017.

    tl;dr: One can expect throughput from 220 entities/second for standard SDK running on B1 instance up to 900 entities/second for a patched SDK running on a B8 instance.

    The limitation is CPU related and changing the instanced type directly impacts performance. This is confirmed by similar results obtained on B4 and B4_1G instances

    The best throughput I got for an Expando entity with about 30 fields is:

    Standard GAE SDK

    • B1 instance: ~220 entities/second
    • B2 instance: ~250 entities/second
    • B4 instance: ~560 entities/second
    • B4_1G instance: ~560 entities/second
    • B8 instance: ~650 entities/second

    Patched GAE SDK

    • B1 instance: ~420 entities/second
    • B8 instance: ~900 entities/second

    For standard GAE SDK I tried various approaches including multi-threading but the best proved to be fetch_async with wait_any. Current NDB library already does a great job of using async and futures under the hood so any attempt to push that using threads only make it worse.

    I found two interesting approaches to optimize this:

    • Matt Faus - Speeding up GAE Datastore Reads with Protobuf Projection
    • Evan Jones - Tracing a Python performance bug on App Engine

    Matt Faus explains the problem very well:

    GAE SDK provides an API for reading and writing objects derived from your classes to the datastore. This saves you the boring work of validating raw data returned from the datastore and repackaging it into an easy-to-use object. In particular, GAE uses protocol buffers to transmit raw data from the store to the frontend machine that needs it. The SDK is then responsible for decoding this format and returning a clean object to your code. This utility is great, but sometimes it does a bit more work than you would like. [...] Using our profiling tool, I discovered that fully 50% of the time spent fetching these entities was during the protobuf-to-python-object decoding phase. This means that the CPU on the frontend server was a bottleneck in these datastore reads!

    GAE-data-access-web-request

    Both approaches try to reduce the time spent doing protobuf to Python decoding by reducing the number of fields decoded.

    I tried both approaches but I only succeed with Matt's. SDK internals changed since Evan published his solution. I had to change a bit the code published by Matt here, but is was pretty easy - if there is interest I can publish the final code.

    For a regular Expando entity with about 30 fields I used Matt's solution to decode only couple fields and obtained a significant improvement.

    In conclusion one need to plan accordingly and don't expect to be able to process much more than few hundreds entities in a "real-time" GAE request.

    0 讨论(0)
  • 2020-12-07 08:04

    The new experimental Data Processing feature (an AppEngine API for MapReduce) looks very suitable for solving this problem. It does automatic sharding to execute multiple parallel worker processes.

    0 讨论(0)
  • 2020-12-07 08:07

    Large data operations on App Engine best implemented using some sort of mapreduce operation.

    Here's a video describing the process, but including BigQuery https://developers.google.com/events/io/sessions/gooio2012/307/

    It doesn't sound like you need BigQuery, but you probably want to use both the Map and Reduce portions of the pipeline.

    The main difference between what you're doing and the mapreduce situation is that you're launching one instance and iterating through the queries, where on mapreduce, you would have a separate instance running in parallel for each query. You will need a reduce operation to "sum up" all the data, and write the result somewhere though.

    The other problem you have is that you should use cursors to iterate. https://developers.google.com/appengine/docs/java/datastore/queries#Query_Cursors

    If the iterator is using a query offset, it'll be inefficient, since an offset issues the same query, skips past a number of results, and gives you the next set, while the cursor jumps straight to the next set.

    0 讨论(0)
  • 2020-12-07 08:10

    Large processing like this should not be done in a user request, which has a 60s time limit. Instead, it should be done in a context that supports long-running requests. The task queue supports requests up to 10 minutes, and (I believe) normal memory restraints (F1 instances, the default, have 128MB of memory). For even higher limits (no request timeout, 1GB+ of memory), use backends.

    Here's something to try: set up a URL that, when accessed, fires off a task queue task. It returns a web page that polls every ~5s to another URL that responds with true/false if the task queue task has been completed yet. The task queue processes the data, which can take some 10s of seconds, and saves the result to the datastore either as the computed data or a rendered web page. Once the initial page detects that it has completed, the user is redirected to the page, which fetches the now computed results from the datastore.

    0 讨论(0)
提交回复
热议问题