PySpark Numeric Window Group By

后端 未结 1 2006
误落风尘
误落风尘 2020-12-19 15:40

I\'d like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x\'s window function

相关标签:
1条回答
  • 2020-12-19 16:23

    You can reuse timestamp one and express parameters in seconds. Tumbling:

    from pyspark.sql.functions import col, window
    
    df.withColumn(
        "window",
        window(
             col("foo").cast("timestamp"), 
             windowDuration="2 seconds"
        ).cast("struct<start:bigint,end:bigint>")
    ).show()
    
    # +---+-------+              
    # |foo| window|
    # +---+-------+
    # | 10|[10,12]|
    # | 11|[10,12]|
    # | 12|[12,14]|
    # | 13|[12,14]|
    # +---+-------+
    

    Rolling one:

    df.withColumn(
        "window", 
        window(
            col("foo").cast("timestamp"),
            windowDuration="2 seconds", slideDuration="1 seconds"
         ).cast("struct<start:bigint,end:bigint>")
    ).show()
    
    # +---+-------+
    # |foo| window|
    # +---+-------+
    # | 10| [9,11]|
    # | 10|[10,12]|
    # | 11|[10,12]|
    # | 11|[11,13]|
    # | 12|[11,13]|
    # | 12|[12,14]|
    # | 13|[12,14]|
    # | 13|[13,15]|
    # +---+-------+
    

    Using groupBy and start:

    w = window(col("foo").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
    start = w.start.alias("start")
    df.groupBy(start).count().show()
    
    +-----+-----+
    |start|count|
    +-----+-----+
    |   10|    2|
    |   12|    2|
    +-----+-----+
    
    0 讨论(0)
提交回复
热议问题