发表新帖

发表新帖

Connection pooling in a streaming pyspark application

前端未结

关注

 1  1454

What is the proper way of using connection pools in a streaming pyspark application ?

I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-

相关标签:

1条回答

佛祖请我去吃肉

2021-01-24 05:48
Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.

Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:
- spark.python.worker.reuse has to be set to true.
- A way to reference an object between different calls.
The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题