Pyspark pivot data frame based on condition

前端未结

关注

 1  1185

I have a data frame in pyspark like below.

df.show()

+---+-------+----+
| id|   type|s_id|
+---+-------+----+
|  1|    ios|  11|
|  1|    ios|  12|


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2021-01-27 23:09
              
            
            
                                                                       
Using the following logic should get you your desired result. 

Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output 

from pyspark.sql import Window 
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")

from pyspark.sql import functions as f

df.withColumn("ranks", f.row_number().over(windowSpec))\
    .filter(f.col("ranks") < 4)\
    .withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
    .drop("ranks")\
    .groupBy("id")\
    .pivot("type")\
    .agg(f.first("s_id"))\
    .show(truncate=False)


which should give you 

+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1  |15      |16      |17      |11  |12  |13  |
|2  |18      |null    |null    |21  |null|null|
+---+--------+--------+--------+----+----+----+


answer for the edited part

You just need an additional filter as 

df.withColumn("ranks", f.row_number().over(windowSpec)) \
    .filter(f.col("ranks") < 4) \
    .filter(f.col("type") != "") \
    .withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
    .drop("ranks") \
    .groupBy("id") \
    .pivot("type") \
    .agg(f.first("s_id")) \
    .show(truncate=False)


which would give you 

+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1  |15      |11  |12  |
|2  |18      |21  |null|
+---+--------+----+----+


Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复