Is there a way to stream results to driver without waiting for all partitions to complete execution?

后端 未结 1 1689
猫巷女王i
猫巷女王i 2021-01-15 13:45

Is there a way to stream results to the driver without waiting for all partitions to complete execution?

I am new to Spark so please point me in the right direction

相关标签:
1条回答
  • 2021-01-15 14:23

    Generally speaking this is not something you would normally do in Spark. Typically we try to limit amount of data which is passed through the driver to the minimum. There two main reasons for that:

    • Passing data to the Spark driver can easily become a bottleneck in your application.
    • Driver is effectively a single point of failure in batch applications.

    In normal case you'd just let the job go on, write to the persistent storage and eventually apply further processing steps on the results.

    If you want to be able to access the results iteratively you have a few options:

    • Use Spark Streaming. Create a simple process which pushes data to the cluster and then collect each batch. It is simple, reliable, tested, and doesn't require any additional infrastructure.
    • Process data using foreach / foreachPartition and push data to the external messaging system as it is produced and use another process to consume and write. This requires additional component but can be easier conceptually (you can use back pressure, buffer the results, separate merging logic from the driver to minimize the risk of the application failure).
    • Hack Spark accumulators. Spark accumulators are updated when task has been finished so you process accumulated upcoming data in discrete batches.

      Warning: Following code is just a proof-of-concept. It hasn't been properly tested and most likely is highly unreliable.

      Example AccumulatorParam using RXPy

      # results_param.py
      
      from rx.subjects import Subject
      from pyspark import AccumulatorParam, TaskContext
      
      class ResultsParam(AccumulatorParam, Subject):
          """An observable accumulator which collects task results"""
          def zero(self, v):
              return []
      
          def addInPlace(self, acc1, acc2):
              # This is executed on the workers so we have to
              # merge the results
              if (TaskContext.get() is not None and 
                      TaskContext().get().partitionId() is not None):
                  acc1.extend(acc2)
                  return acc1
              else:
                  # This is executed on the driver so we discard the results
                  # and publish to self instead
                  for x in acc2:
                      self.on_next(x)
                  return []
      

      Simple Spark application (Python 3.x):

      # main.py
      
      import time
      from pyspark import SparkContext, TaskContext
      
      sc = SparkContext(master="local[4]")
      sc.addPyFile("results_param.py")
      
      from results_param import ResultsParam
      
      # Define accumulator
      acc = sc.accumulator([], ResultsParam())
      
      # Dummy subscriber 
      acc.accum_param.subscribe(print)
      
      def process(x):
          """Identity proccess"""
          result = x
          acc.add([result])
      
          # Add some delay
          time.sleep(5)
      
          return result
      
      sc.parallelize(range(32), 8).foreach(process)
      

      This is relatively simple but there is a risk of overwhelming the driver if multiple tasks finish at the same time so you have to significantly oversubscribe driver resources (proportionally to the parallelism level and an expected size of the task result).

    • Use Scala runJob directly (not Python friendly).

      Spark actually fetches the results asynchronously and it is not required to wait for all the data to be processed, as long as you don't care about the order. You can see for example the implementation Scala reduce.

      It should be possible to use this mechanism to push partitions to the Python process as they come, but I haven't tried it yet.

    0 讨论(0)
提交回复
热议问题