How to reference a dataframe when in an UDF on another dataframe?

后端未结

关注

 2  1950

How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?

Here\'s a dummy example. I am creating two dataframes scores


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-01-06 03:14
              
            
            
                                                                       
Changing pair to dictionary for easy lookup of names

data2 = {}
for i in range(len(student_ids)):
    data2[student_ids[i]] = last_name[i]


Instead of creating rdd and making it to df create broadcast variable

//rdd = sc.parallelize(data2) 
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)  


Now access this in udf with values attr on broadcast variable(lastnames).

from pyspark.sql.functions import udf
def getLastName(sid):
    return lastnames.value[sid]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-01-06 03:15
              
            
            
                                                                       
You can't directly reference a dataframe (or an RDD) from inside a UDF.  The DataFrame object is a handle on your driver that spark uses to represent the data and actions that will happen out on the cluster.  The code inside your UDF's will run out on the cluster at a time of Spark's choosing.  Spark does this by serializing that code, and making copies of any variables included in the closure and sending them out to each worker.  

What instead you want to do, is use the constructs Spark provides in it's API to join/combine the two DataFrames.  If one of the data sets is small, you can manually send out the data in a broadcast variable, and then access it from your UDF.  Otherwise, you can just create the two dataframes like you did, then use the join operation to combine them.  Something like this should work:

joined = scores.withColumnRenamed("student_id", "join_id")
joined = joined.join(lastnames, joined.join_id == lastnames.student_id)\
               .drop("join_id")
joined.show()

+---------+-----+----------+---------+
|  subject|score|student_id|last_name|
+---------+-----+----------+---------+
|     Math|   13|  student1|  Granger|
|  Biology|   85|  student1|  Granger|
|Chemistry|   77|  student1|  Granger|
|  Physics|   25|  student1|  Granger|
|     Math|   50|  student2|  Weasley|
|  Biology|   45|  student2|  Weasley|
|Chemistry|   65|  student2|  Weasley|
|  Physics|   79|  student2|  Weasley|
|     Math|    9|  student3|   Potter|
|  Biology|    2|  student3|   Potter|
|Chemistry|   84|  student3|   Potter|
|  Physics|   43|  student3|   Potter|
+---------+-----+----------+---------+


It's also worth noting, that under the hood Spark DataFrames has an optimization where a DataFrame that is part of a join can be converted to a broadcast variable to avoid a shuffle if it is small enough.  So if you do the join method listed above, you should get the best possible performance, without sacrificing the ability to handle larger data sets.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复