问题
I'm trying to solve the following problem:
- I have a series of "tasks" which I would like to execute
- I have a fixed number of workers to execute these workers (since they call an external API using urlfetch and the number of parallel calls to this API is limited)
- I would like for these "tasks" to be executed "as soon as possible" (ie. minimum latency)
- These tasks are parts of larger tasks and can be categorized based on the size of the original task (ie. a small original task might generate 1 to 100 tasks, a medium one 100 to 1000 and a large one over 1000).
The tricky part: I would like to do all this efficiently (ie. minimum latency and use as many parallel API calls as possible - without getting over the limit), but at the same time try to prevent a large number of tasks generated from "large" original tasks to delay the tasks generated from "small" original tasks.
To put it an other way: I would like to have a "priority" assigned to each task with "small" tasks having a higher priority and thus prevent starvation from "large" tasks.
Some searching around doesn't seem to indicate that anything pre-made is available, so I came up with the following:
- create three push queues:
tasks-small
,tasks-medium
,tasks-large
- set a maximum number of concurrent request for each such that the total is the maximum number of concurrent API calls (for example if the max. no. concurrent API calls is 200, I could set up
tasks-small
to have amax_concurrent_requests
of 30,tasks-medium
60 andtasks-large
100) - when enqueueing a task, check the no. pending task in each queue (using something like the QueueStatistics class), and, if an other queue is not 100% utilized, enqueue the task there, otherwise just enqueue the task on the queue with the corresponding size.
For example, if we have task T1
which is part of a small task, first check if tasks-small
has free "slots" and enqueue it there. Otherwise check tasks-medium
and tasks-large
. If none of them have free slots, enqueue it on tasks-small
anyway and it will be processed after the tasks added before it are processed (note: this is not optimal because if "slots" free up on the other queues, they still won't process pending tasks from the tasks-small
queue)
An other option would be to use PULL queue and have a central "coordinator" pull from that queue based on priorities and dispatch them, however that seems to add a little more latency.
However this seems a little bit hackish and I'm wondering if there are better alternatives out there.
EDIT: after some thoughts and feedback I'm thinking of using PULL queue after all in the following way:
- have two PULL queues (
medium-tasks
andlarge-tasks
) - have a dispatcher (PUSH) queue with a concurrency of 1 (so that only one dispatch task runs at any time). Dispatch tasks are created in multiple ways:
- by a once-a-minute cron job
- after adding a medium/large task to the push queues
- after a worker task finishes
- have a worker (PUSH) queue with a concurrency equal to the number of workers
And the workflow:
- small tasks are added directly to the worker queue
- the dispatcher task, whenever it is triggered, does the following:
- estimates the number of free workers (by looking at the number of running tasks in the worker queue)
- for any "free" slots it takes a task from the medium/large tasks PULL queue and enqueues it on a worker (or more precisely: adds it to the worker PUSH queue which will result in it being executed - eventually - on a worker).
I'll report back once this is implemented and at least moderately tested.
回答1:
The small/medium/large original task queues won't help much by themselves - once the original tasks are enqueued they'll keep spawning worker tasks, potentially even breaking the worker task queue size limit. So you need to pace/control enqueing of the original tasks.
I'd keep track of the "todo" original tasks in the datastore/GCS and enqueue these original tasks only when the respective queue size is sufficiently low (1 or maybe 2 pending jobs), from either a recurring task, a cron job or a deferred task (depending on the rate at which you need to perform the original task enqueueing) which would implement the desired pacing and priority logic just like a push queue dispatcher, but without the extra latency you mentioned.
回答2:
I have not used pull queues, but from my understanding they could suit your use-case very well. Your could define 3 pull queues, and have X
workers all pulling tasks from them, first trying the "small" queue then moving on to "medium" if it is empty (where X
is your maximum concurrency). You should not need a central dispatcher.
However, then you would be left to pay for X
workers even when there are no tasks (or X / threadsPerMachine
?), or scale them down & up yourself.
So, here is another thought: make a single push queue with the correct maximum concurrency
. When you receive a new task, push its info to the datastore and queue up a generic job. That generic job will then consult the datastore looking for tasks in priority order, executing the first one it finds. This way a short task will still be executed by the next job, even if that job was already enqueued from a large task.
回答3:
EDIT: I now migrated to a simpler solution, similar to what @eric-simonton described:
- I have multiple PULL queues, one for each priority
- Many workers pull on an endpoint (handler)
- The handler generates a random number and does a simple "if less than 0.6, try first the small queue and then the large queue, else vice-versa (large then small)"
- If the workers get no tasks or an error, they do semi-random exponential backoff up to maximum timeout (ie. they start pulling every 1 second and approximately double the timeout after each empty pull up to 30 seconds)
This final point is needed - amongst other reasons - because the number of pulls / second from a PULL queue is limited to 10k/s: https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull#Python_Leasing_tasks
I implemented the solution described in the UPDATE:
- two PULL queues (medium-tasks and large-tasks)
- a dispatcher (PUSH) queue with a concurrency of 1
- a worker (PUSH) queue with a concurrency equal to the number of workers
See the question for more details. Some notes:
- there is some delay in task visibility due to eventual consistency (ie. the dispatchers tasks sometimes don't see the tasks from the pull queue even if they are inserted together) - I worked around by adding a countdown of 5 seconds to the dispatcher tasks and also adding a cron job that adds a dispatcher task every minute (so if the original dispatcher task doesn't "see" the task from the pull queue, an other will come along later)
- made sure to name every task to eliminate the possibility of double-dispatching them
- you can't lease 0 items from the PULL queues :-)
- batch operations have an upper limit, so you have to do your own batching over the batch taskqueue calls
- there doesn't seem to be a way to programatically get the "maximum parallelism" value for a queue, so I had to hard-code that in the dispatcher (to calculate how many more tasks it can schedule)
- don't add dispatcher tasks if they are already some (at least 10) in the queue
来源:https://stackoverflow.com/questions/38567153/how-can-tasks-be-prioritized-when-using-the-task-queue-on-google-app-engine