Pyspark: reshape data without aggregation

后端 未结 2 397
情话喂你
情话喂你 2021-01-16 17:40

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:

co         


        
相关标签:
2条回答
  • Using groupby and pivot is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter and join

    import pyspark.sql.functions as f
    
    df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
        .join(
            df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
            on="FAULTY"
        )\
        .show()
    #+------+------------+------------+
    #|FAULTY|value_HIGH_1|value_HIGH_1|
    #+------+------------+------------+
    #|     0|          12|         140|
    #|     1|          21|         141|
    #+------+------------+------------+
    
    0 讨论(0)
  • 2021-01-16 18:22

    You can use pivot with a fake maximum aggregation (since you have only one element for each group):

    import pyspark.sql.functions as F
    df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
        'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
    ).show()
    +------+------------+------------+
    |FAULTY|value_high_1|value_high_0|
    +------+------------+------------+
    |     0|          12|         140|
    |     1|          21|         141|
    +------+------------+------------+
    
    0 讨论(0)
提交回复
热议问题