问题
I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:
- a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although
Akka
is distributed, there are issues to (de)serialize theactorSystem
in a distributed Spark context. - using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
- I suppose I could also just
collect
the dataframes to the Sparkdriver
and handle throttling there with one of above options, but I would like to keep this distributed.
How are such things typically handled?
回答1:
You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.
Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions
to send requests
from each partition within its R/(e*c) rate limit.
来源:https://stackoverflow.com/questions/58880255/throttle-concurrent-http-requests-from-spark-executors