问题
What is the proper way of using connection pools in a streaming pyspark application ?
I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application.
回答1:
Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.
Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:
spark.python.worker.reuse
has to be set to true.- A way to reference an object between different calls.
The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.
来源:https://stackoverflow.com/questions/38255924/connection-pooling-in-a-streaming-pyspark-application