I have a data frame in pyspark
like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
Using the following logic should get you your desired result.
Window
function is used to generate row number for each group of id
and type
ordered by s_id
. Generated row number is used to filter
and concat
with type
. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3
columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values