Connection pooling in a streaming pyspark application

烂漫一生 提交于 2019-12-20 04:15:48

问题


​What is the proper way of using connection pools in a streaming pyspark application ?

I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application.


回答1:


Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.

Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:

  • spark.python.worker.reuse has to be set to true.
  • A way to reference an object between different calls.

The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.



来源:https://stackoverflow.com/questions/38255924/connection-pooling-in-a-streaming-pyspark-application

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!