Pyspark pivot data frame based on condition

前端 未结 1 1182
太阳男子
太阳男子 2021-01-27 22:56

I have a data frame in pyspark like below.

df.show()

+---+-------+----+
| id|   type|s_id|
+---+-------+----+
|  1|    ios|  11|
|  1|    ios|  12|         


        
相关标签:
1条回答
  • 2021-01-27 23:09

    Using the following logic should get you your desired result.

    Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output

    from pyspark.sql import Window 
    windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
    
    from pyspark.sql import functions as f
    
    df.withColumn("ranks", f.row_number().over(windowSpec))\
        .filter(f.col("ranks") < 4)\
        .withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
        .drop("ranks")\
        .groupBy("id")\
        .pivot("type")\
        .agg(f.first("s_id"))\
        .show(truncate=False)
    

    which should give you

    +---+--------+--------+--------+----+----+----+
    |id |android1|android2|android3|ios1|ios2|ios3|
    +---+--------+--------+--------+----+----+----+
    |1  |15      |16      |17      |11  |12  |13  |
    |2  |18      |null    |null    |21  |null|null|
    +---+--------+--------+--------+----+----+----+
    

    answer for the edited part

    You just need an additional filter as

    df.withColumn("ranks", f.row_number().over(windowSpec)) \
        .filter(f.col("ranks") < 4) \
        .filter(f.col("type") != "") \
        .withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
        .drop("ranks") \
        .groupBy("id") \
        .pivot("type") \
        .agg(f.first("s_id")) \
        .show(truncate=False)
    

    which would give you

    +---+--------+----+----+
    |id |andriod1|ios1|ios2|
    +---+--------+----+----+
    |1  |15      |11  |12  |
    |2  |18      |21  |null|
    +---+--------+----+----+
    

    Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values

    0 讨论(0)
提交回复
热议问题