What is the proper way of using connection pools in a streaming pyspark application ?
I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-
Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.
Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:
spark.python.worker.reuse
has to be set to true.The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.