How to retrieve all columns using pyspark collect_list functions

前端 未结 3 1118
梦如初夏
梦如初夏 2021-01-14 05:42

I have a pyspark 2.0.1. I\'m trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that

z=data1.groupby(\'         


        
3条回答
  •  隐瞒了意图╮
    2021-01-14 05:53

    Actually we can do it in pyspark 2.2 .

    First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.

    Below is the code:

    import pyspark.sql.functions as ftions
    import functools as ftools
    
    def groupColumnData(df, columns):
          df = df.withColumn("Temp", ftions.lit(1))
          exprs = [ftions.collect_list(colName) for colName in columns]
          df = df.groupby('Temp').agg(*exprs)
          df = df.drop("Temp")
          df = df.toDF(*columns)
          return df
    

    Input Data:

    df.show()
    +---+---+---+
    |  a|  b|  c|
    +---+---+---+
    |  0|  1|  2|
    |  0|  4|  5|
    |  1|  7|  8|
    |  1|  8|  7|
    +---+---+---+
    

    Output Data:

    df.show()
    
        +------------+------------+------------+
        |           a|           b|           c|
        +------------+------------+------------+
        |[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
        +------------+------------+------------+
    

提交回复
热议问题