How can tasks be prioritized when using the task queue on google app engine?

问题

I'm trying to solve the following problem:

I have a series of "tasks" which I would like to execute
I have a fixed number of workers to execute these workers (since they call an external API using urlfetch and the number of parallel calls to this API is limited)
I would like for these "tasks" to be executed "as soon as possible" (ie. minimum latency)
These tasks are parts of larger tasks and can be categorized based on the size of the original task (ie. a small original task might generate 1 to 100 tasks, a medium one 100 to 1000 and a large one over 1000).

The tricky part: I would like to do all this efficiently (ie. minimum latency and use as many parallel API calls as possible - without getting over the limit), but at the same time try to prevent a large number of tasks generated from "large" original tasks to delay the tasks generated from "small" original tasks.

To put it an other way: I would like to have a "priority" assigned to each task with "small" tasks having a higher priority and thus prevent starvation from "large" tasks.

Some searching around doesn't seem to indicate that anything pre-made is available, so I came up with the following:

create three push queues: tasks-small, tasks-medium, tasks-large
set a maximum number of concurrent request for each such that the total is the maximum number of concurrent API calls (for example if the max. no. concurrent API calls is 200, I could set up tasks-small to have a max_concurrent_requests of 30, tasks-medium 60 and tasks-large 100)
when enqueueing a task, check the no. pending task in each queue (using something like the QueueStatistics class), and, if an other queue is not 100% utilized, enqueue the task there, otherwise just enqueue the task on the queue with the corresponding size.

For example, if we have task T1 which is part of a small task, first check if tasks-small has free "slots" and enqueue it there. Otherwise check tasks-medium and tasks-large. If none of them have free slots, enqueue it on tasks-small anyway and it will be processed after the tasks added before it are processed (note: this is not optimal because if "slots" free up on the other queues, they still won't process pending tasks from the tasks-small queue)

An other option would be to use PULL queue and have a central "coordinator" pull from that queue based on priorities and dispatch them, however that seems to add a little more latency.

However this seems a little bit hackish and I'm wondering if there are better alternatives out there.

EDIT: after some thoughts and feedback I'm thinking of using PULL queue after all in the following way:

have two PULL queues (medium-tasks and large-tasks)
have a dispatcher (PUSH) queue with a concurrency of 1 (so that only one dispatch task runs at any time). Dispatch tasks are created in multiple ways:
- by a once-a-minute cron job
- after adding a medium/large task to the push queues
- after a worker task finishes
have a worker (PUSH) queue with a concurrency equal to the number of workers

And the workflow:

small tasks are added directly to the worker queue
the dispatcher task, whenever it is triggered, does the following:
- estimates the number of free workers (by looking at the number of running tasks in the worker queue)
- for any "free" slots it takes a task from the medium/large tasks PULL queue and enqueues it on a worker (or more precisely: adds it to the worker PUSH queue which will result in it being executed - eventually - on a worker).

I'll report back once this is implemented and at least moderately tested.

回答1:

The small/medium/large original task queues won't help much by themselves - once the original tasks are enqueued they'll keep spawning worker tasks, potentially even breaking the worker task queue size limit. So you need to pace/control enqueing of the original tasks.

I'd keep track of the "todo" original tasks in the datastore/GCS and enqueue these original tasks only when the respective queue size is sufficiently low (1 or maybe 2 pending jobs), from either a recurring task, a cron job or a deferred task (depending on the rate at which you need to perform the original task enqueueing) which would implement the desired pacing and priority logic just like a push queue dispatcher, but without the extra latency you mentioned.

回答2:

I have not used pull queues, but from my understanding they could suit your use-case very well. Your could define 3 pull queues, and have X workers all pulling tasks from them, first trying the "small" queue then moving on to "medium" if it is empty (where X is your maximum concurrency). You should not need a central dispatcher.

However, then you would be left to pay for X workers even when there are no tasks (or X / threadsPerMachine?), or scale them down & up yourself.

So, here is another thought: make a single push queue with the correct maximum concurrency. When you receive a new task, push its info to the datastore and queue up a generic job. That generic job will then consult the datastore looking for tasks in priority order, executing the first one it finds. This way a short task will still be executed by the next job, even if that job was already enqueued from a large task.

回答3:

EDIT: I now migrated to a simpler solution, similar to what @eric-simonton described:

I have multiple PULL queues, one for each priority
Many workers pull on an endpoint (handler)
The handler generates a random number and does a simple "if less than 0.6, try first the small queue and then the large queue, else vice-versa (large then small)"
If the workers get no tasks or an error, they do semi-random exponential backoff up to maximum timeout (ie. they start pulling every 1 second and approximately double the timeout after each empty pull up to 30 seconds)

This final point is needed - amongst other reasons - because the number of pulls / second from a PULL queue is limited to 10k/s: https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull#Python_Leasing_tasks

I implemented the solution described in the UPDATE:

two PULL queues (medium-tasks and large-tasks)
a dispatcher (PUSH) queue with a concurrency of 1
a worker (PUSH) queue with a concurrency equal to the number of workers

See the question for more details. Some notes:

there is some delay in task visibility due to eventual consistency (ie. the dispatchers tasks sometimes don't see the tasks from the pull queue even if they are inserted together) - I worked around by adding a countdown of 5 seconds to the dispatcher tasks and also adding a cron job that adds a dispatcher task every minute (so if the original dispatcher task doesn't "see" the task from the pull queue, an other will come along later)
made sure to name every task to eliminate the possibility of double-dispatching them
you can't lease 0 items from the PULL queues :-)
batch operations have an upper limit, so you have to do your own batching over the batch taskqueue calls
there doesn't seem to be a way to programatically get the "maximum parallelism" value for a queue, so I had to hard-code that in the dispatcher (to calculate how many more tasks it can schedule)
don't add dispatcher tasks if they are already some (at least 10) in the queue

来源：https://stackoverflow.com/questions/38567153/how-can-tasks-be-prioritized-when-using-the-task-queue-on-google-app-engine

标签

google-app-engine

task-queue