Pivot String column on Pyspark Dataframe

前端 未结 1 913
無奈伤痛
無奈伤痛 2020-12-01 02:31

I have a simple dataframe like this:

rdd = sc.parallelize(
    [
        (0, \"A\", 223,\"201603\", \"PORT\"), 
        (0, \"A\", 22,\"201602\", \"PORT\"),          


        
相关标签:
1条回答
  • 2020-12-01 02:51

    Assuming that (id |type | date) combinations are unique and your only goal is pivoting and not aggregation you can use first (or any other function not restricted to numeric values):

    from pyspark.sql.functions import first
    
    (df_data
        .groupby(df_data.id, df_data.type)
        .pivot("date")
        .agg(first("ship"))
        .show())
    
    ## +---+----+------+------+------+
    ## | id|type|201601|201602|201603|
    ## +---+----+------+------+------+
    ## |  2|   C|  DOCK|  null|  null|
    ## |  0|   A|  DOCK|  PORT|  PORT|
    ## |  1|   B|  PORT|  DOCK|  null|
    ## +---+----+------+------+------+
    

    If these assumptions is not correct you'll have to pre-aggregate your data. For example for the most common ship value:

    from pyspark.sql.functions import max, struct
    
    (df_data
        .groupby("id", "type", "date", "ship")
        .count()
        .groupby("id", "type")
        .pivot("date")
        .agg(max(struct("count", "ship")))
        .show())
    
    ## +---+----+--------+--------+--------+
    ## | id|type|  201601|  201602|  201603|
    ## +---+----+--------+--------+--------+
    ## |  2|   C|[1,DOCK]|    null|    null|
    ## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|
    ## |  1|   B|[1,PORT]|[1,DOCK]|    null|
    ## +---+----+--------+--------+--------+
    
    0 讨论(0)
提交回复
热议问题