Pyspark: reshape data without aggregation

后端未结

关注

 2  397

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:

co


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2021-01-16 18:16
              
            
            
                                                                       

Using groupby and pivot is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter and join

import pyspark.sql.functions as f

df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
    .join(
        df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
        on="FAULTY"
    )\
    .show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#|     0|          12|         140|
#|     1|          21|         141|
#+------+------------+------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2021-01-16 18:22
              
            
            
                                                                       
You can use pivot with a fake maximum aggregation (since you have only one element for each group):

import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
    'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
|     0|          12|         140|
|     1|          21|         141|
+------+------------+------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复