Throttle concurrent HTTP requests from Spark executors

早过忘川 提交于 2020-05-27 06:42:07

问题


I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:

  • a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although Akka is distributed, there are issues to (de)serialize the actorSystem in a distributed Spark context.
  • using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
  • I suppose I could also just collect the dataframes to the Spark driver and handle throttling there with one of above options, but I would like to keep this distributed.

How are such things typically handled?


回答1:


You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.

Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions to send requests from each partition within its R/(e*c) rate limit.



来源:https://stackoverflow.com/questions/58880255/throttle-concurrent-http-requests-from-spark-executors

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!