How to join two DataFrames and change column for missing values?

后端未结

关注

 3  1533

余生分开走 2021-01-26 08:46

val df1 = sc.parallelize(Seq(
   (\"a1\",10,\"ACTIVE\",\"ds1\"),
   (\"a1\",20,\"ACTIVE\",\"ds1\"),
   (\"a2\",50,\"ACTIVE\",\"ds1\"),
   (\"a3\",60,\"ACTIVE\",\"ds1\"))


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   南笙
                                             
                
                
                (楼主)
            
              
              
                2021-01-26 09:02
              

            
            
                        
First, a small thing. I use different names for the columns in df2:

val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")


No big deal, but this made things easier for me to reason about.

Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:

val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates


Here I do the following:


Inner join between df1 and df2 on the c1 and d1 columns
Select the df2 columns and simply "hardcode" ds1 in the last column to replace ds2
Drop duplicates


This basically just filters out everything in df2 that does not have a corresponding key in c1 in df1.

Next I diff:

val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")


This is a basic set operation that finds everything in join that is not in df1. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE value.

All that's left is to put them all together:

df1.union(diff)


This simply combines df1 with the table of deactivated values we calculated earlier to produce the final result:

+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+


And again, you don't need all these intermediate values. I just was verbose to help trace through the process.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复