How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

前端未结
关注
 1  1665
庸人自扰 2020-12-24 15:21
In Spark 1.6.0 / Scala, is there an opportunity to get collect_list(\"colC\") or collect_set(\"colC\").over(Window.partitionBy(\"colA\").orderBy(\"colB\")

      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   有刺的猬
                                             
                
                
                (楼主)
            
              
              
                2020-12-24 16:06
              

            
            
                        
Given that you have dataframe as

+----+----+----+
|colA|colB|colC|
+----+----+----+
|1   |1   |23  |
|1   |2   |63  |
|1   |3   |31  |
|2   |1   |32  |
|2   |2   |56  |
+----+----+----+


You can Window functions by doing the following

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)


Result: 

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[23, 63]    |
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+


Similar is the result for collect_set as well. But the order of elements in the final set will not be in order as with collect_list

df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[63, 23]    |
|1   |3   |31  |[63, 31, 23]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[56, 32]    |
+----+----+----+------------+


If you remove orderBy as below

df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)


result would be 

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23, 63, 31]|
|1   |2   |63  |[23, 63, 31]|
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32, 56]    |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+


I hope the answer is helpful
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复