Spark and SparkSQL: How to imitate window function?

前端 未结 3 1748
野趣味
野趣味 2020-12-06 02:36

Description

Given a dataframe df

id |       date
---------------
 1 | 2015-09-01
 2 | 2015-09-01
 1 | 2015-09-03
 1 | 2015-09-04
 2 |          


        
相关标签:
3条回答
  • 2020-12-06 02:50

    I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this

    val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
               .toDF("id", "date")
    
    val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
    val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
                          .where($"date"<=$"dateDup")
                          .groupBy($"id", $"date")
                          .agg($"id", $"date", count($"idDup").as("counter"))
                          .select($"id",$"date",$"counter")
    

    Now if you do dfWithCounter.show

    You will get:

    +---+----------+-------+                                                        
    | id|      date|counter|
    +---+----------+-------+
    |  1|2015-09-01|      1|
    |  1|2015-09-04|      3|
    |  1|2015-09-03|      2|
    |  2|2015-09-01|      1|
    |  2|2015-09-04|      2|
    +---+----------+-------+
    

    Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.

    0 讨论(0)
  • 2020-12-06 03:07

    You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.

    val df = sqlContext.sql("select 1, '2015-09-01'"
        ).unionAll(sqlContext.sql("select 2, '2015-09-01'")
        ).unionAll(sqlContext.sql("select 1, '2015-09-03'")
        ).unionAll(sqlContext.sql("select 1, '2015-09-04'")
        ).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
    
    // dataframe as an RDD (of Row objects)
    df.rdd 
      // grouping by the first column of the row
      .groupBy(r => r(0)) 
      // map each group - an Iterable[Row] - to a list and sort by the second column
      .map(g => g._2.toList.sortBy(row => row(1).toString))     
      .collect()
    

    The above gives a result like the following:

    Array[List[org.apache.spark.sql.Row]] = 
    Array(
      List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]), 
      List([2,2015-09-01], [2,2015-09-04]))
    

    If you want the position within the 'group' as well, you can use zipWithIndex.

    df.rdd.groupBy(r => r(0)).map(g => 
        g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
    
    Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
      List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
      List(([2,2015-09-01],0), ([2,2015-09-04],1)))
    

    You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.

    The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.

    0 讨论(0)
  • 2020-12-06 03:12

    You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.

    import org.apache.spark.{SparkContext, SparkConf}
    import org.apache.spark.sql.hive.HiveContext
    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions.rowNumber
    
    object HiveContextTest {
      def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Hive Context")
        val sc = new SparkContext(conf)
        val sqlContext = new HiveContext(sc)
        import sqlContext.implicits._
    
        val df = sc.parallelize(
            ("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
        ).toDF("k", "v")
    
        val w = Window.partitionBy($"k").orderBy($"v")
        df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
      }
    }
    
    0 讨论(0)
提交回复
热议问题