How to compute cumulative sum using Spark

后端 未结 5 949
有刺的猬
有刺的猬 2020-12-01 08:51

I have an rdd of (String,Int) which is sorted by key

val data = Array((\"c1\",6), (\"c2\",3),(\"c3\",4))
val rdd = sc.parallelize(data).sortByKey


        
相关标签:
5条回答
  • 2020-12-01 08:58

    you can want to try out with windows over using rowsBetween. hope still helpful.

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions.Window
    
    val data = Array(("c1",6), ("c2",3),("c3",4))
    val df = sc.parallelize(data).sortByKey().toDF("c", "v")
    val w = Window.orderBy("c")
    val r = df.select( $"c", sum($"v").over(w.rowsBetween(-2, -1)).alias("cs"))
    display(r)
    
    0 讨论(0)
  • 2020-12-01 09:00

    Here is a solution in PySpark. Internally it's essentially the same as @zero323's Scala solution, but it provides a general-purpose function with a Spark-like API.

    import numpy as np
    def cumsum(rdd, get_summand):
        """Given an ordered rdd of items, computes cumulative sum of
        get_summand(row), where row is an item in the RDD.
        """
        def cumsum_in_partition(iter_rows):
            total = 0
            for row in iter_rows:
                total += get_summand(row)
                yield (total, row)
        rdd = rdd.mapPartitions(cumsum_in_partition)
    
        def last_partition_value(iter_rows):
            final = None
            for cumsum, row in iter_rows:
                final = cumsum
            return (final,)
    
        partition_sums = rdd.mapPartitions(last_partition_value).collect()
        partition_cumsums = list(np.cumsum(partition_sums))
        partition_cumsums = [0] + partition_cumsums
        partition_cumsums = sc.broadcast(partition_cumsums)
    
        def add_sums_of_previous_partitions(idx, iter_rows):
            return ((cumsum + partition_cumsums.value[idx], row)
                for cumsum, row in iter_rows)
        rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
        return rdd
    
    # test for correctness by summing numbers, with and without Spark
    rdd = sc.range(10000,numSlices=10).sortBy(lambda x: x)
    cumsums, values = zip(*cumsum(rdd,lambda x: x).collect())
    assert all(cumsums == np.cumsum(values))
    
    0 讨论(0)
  • 2020-12-01 09:04
    1. Compute partial results for each partition:

      val partials = rdd.mapPartitionsWithIndex((i, iter) => {
        val (keys, values) = iter.toSeq.unzip
        val sums  = values.scanLeft(0)(_ + _)
        Iterator((keys.zip(sums.tail), sums.last))
      })
      
    2. Collect partials sums

      val partialSums = partials.values.collect
      
    3. Compute cumulative sum over partitions and broadcast it:

      val sumMap = sc.broadcast(
        (0 until rdd.partitions.size)
          .zip(partialSums.scanLeft(0)(_ + _))
          .toMap
      )
      
    4. Compute final results:

      val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
        val offset = sumMap.value(i)
        if (iter.isEmpty) Iterator()
        else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
      })
      
    0 讨论(0)
  • 2020-12-01 09:10

    Spark has buit-in supports for hive ANALYTICS/WINDOWING functions and the cumulative sum could be achieved easily using ANALYTICS functions.

    Hive wiki ANALYTICS/WINDOWING functions.

    Example:

    Assuming you have sqlContext object-

    val datardd = sqlContext.sparkContext.parallelize(Seq(("a",1),("b",2), ("c",3),("d",4),("d",5),("d",6)))
    import sqlContext.implicits._
    
    //Register as test table
    datardd.toDF("id","val").createOrReplaceTempView("test")
    
    //Calculate Cumulative sum
    sqlContext.sql("select id,val, " +
      "SUM(val) over (  order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
      "from test").show()
    

    This approach cause to below warning. In case executor runs outOfMemory, tune job’s memory parameters accordingly to work with huge dataset.

    WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation

    I hope this helps.

    0 讨论(0)
  • 2020-12-01 09:18

    I came across a similar problem and implemented @Paul 's solution. I wanted to do cumsum on a integer frequency table sorted by key(the integer), and there was a minor problem with np.cumsum(partition_sums), error being unsupported operand type(s) for +=: 'int' and 'NoneType'.

    Because if the range is big enough, the probability of each partition having something is thus big enough(no None values). However, if the range is much smaller than count, and number of partitions remains the same, some of the partitions would be empty. Here comes the modified solution:

    def cumsum(rdd, get_summand):
        """Given an ordered rdd of items, computes cumulative sum of
        get_summand(row), where row is an item in the RDD.
        """
        def cumsum_in_partition(iter_rows):
            total = 0
            for row in iter_rows:
                total += get_summand(row)
                yield (total, row)
        rdd = rdd.mapPartitions(cumsum_in_partition)
        def last_partition_value(iter_rows):
            final = None
            for cumsum, row in iter_rows:
                final = cumsum
            return (final,)
        partition_sums = rdd.mapPartitions(last_partition_value).collect()
        # partition_cumsums = list(np.cumsum(partition_sums))
    
        #----from here are the changed lines
        partition_sums = [x for x in partition_sums if x is not None] 
        temp = np.cumsum(partition_sums)
        partition_cumsums = list(temp)
        #----
    
        partition_cumsums = [0] + partition_cumsums   
        partition_cumsums = sc.broadcast(partition_cumsums)
        def add_sums_of_previous_partitions(idx, iter_rows):
            return ((cumsum + partition_cumsums.value[idx], row)
                for cumsum, row in iter_rows)
        rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
        return rdd
    
    #test on random integer frequency
    x = np.random.randint(10, size=1000)
    D = sqlCtx.createDataFrame(pd.DataFrame(x.tolist(),columns=['D']))
    c = D.groupBy('D').count().orderBy('D')
    c_rdd =  c.rdd.map(lambda x:x['count'])
    cumsums, values = zip(*cumsum(c_rdd,lambda x: x).collect())
    
    0 讨论(0)
提交回复
热议问题