Is there a way to stream results to the driver without waiting for all partitions to complete execution?
I am new to Spark so please point me in the right direction
Generally speaking this is not something you would normally do in Spark. Typically we try to limit amount of data which is passed through the driver to the minimum. There two main reasons for that:
In normal case you'd just let the job go on, write to the persistent storage and eventually apply further processing steps on the results.
If you want to be able to access the results iteratively you have a few options:
foreach
/ foreachPartition
and push data to the external messaging system as it is produced and use another process to consume and write. This requires additional component but can be easier conceptually (you can use back pressure, buffer the results, separate merging logic from the driver to minimize the risk of the application failure).Hack Spark accumulators. Spark accumulators are updated when task has been finished so you process accumulated upcoming data in discrete batches.
Warning: Following code is just a proof-of-concept. It hasn't been properly tested and most likely is highly unreliable.
Example AccumulatorParam
using RXPy
# results_param.py
from rx.subjects import Subject
from pyspark import AccumulatorParam, TaskContext
class ResultsParam(AccumulatorParam, Subject):
"""An observable accumulator which collects task results"""
def zero(self, v):
return []
def addInPlace(self, acc1, acc2):
# This is executed on the workers so we have to
# merge the results
if (TaskContext.get() is not None and
TaskContext().get().partitionId() is not None):
acc1.extend(acc2)
return acc1
else:
# This is executed on the driver so we discard the results
# and publish to self instead
for x in acc2:
self.on_next(x)
return []
Simple Spark application (Python 3.x):
# main.py
import time
from pyspark import SparkContext, TaskContext
sc = SparkContext(master="local[4]")
sc.addPyFile("results_param.py")
from results_param import ResultsParam
# Define accumulator
acc = sc.accumulator([], ResultsParam())
# Dummy subscriber
acc.accum_param.subscribe(print)
def process(x):
"""Identity proccess"""
result = x
acc.add([result])
# Add some delay
time.sleep(5)
return result
sc.parallelize(range(32), 8).foreach(process)
This is relatively simple but there is a risk of overwhelming the driver if multiple tasks finish at the same time so you have to significantly oversubscribe driver resources (proportionally to the parallelism level and an expected size of the task result).
Use Scala runJob
directly (not Python friendly).
Spark actually fetches the results asynchronously and it is not required to wait for all the data to be processed, as long as you don't care about the order. You can see for example the implementation Scala reduce.
It should be possible to use this mechanism to push partitions to the Python process as they come, but I haven't tried it yet.