Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

前端未结

关注

 2  1838

I have a function that calculates RMSE for the preds and actuals of an entire dataframe:

def calculate_rmse(df, actual_column, prediction_column):
    RMSE = F.u


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2021-01-23 12:07
              
            
            
                                                                       
I don't think you need a UDF for this - I think you should be able to take the difference between the two columns (df.withColumn('difference', col('true') - col('pred'))), then compute the square of that column (df.withColumn('squared_difference', pow(col('difference'), lit(2).astype(IntegerType()))), and compute the average of the column (df.withColumn('rmse', avg('squared_difference'))). Putting it all together with an example:

from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.getOrCreate()

sql_context = SQLContext(spark.sparkContext)

df = sql_context.createDataFrame([(0.0, 1.0),
                                  (1.0, 2.0),
                                  (3.0, 5.0),
                                  (1.0, 8.0)], schema=['true', 'predicted'])

df = df.withColumn('difference', F.col('true') - F.col('predicted'))
df = df.withColumn('squared_difference', F.pow(F.col('difference'), F.lit(2).astype(IntegerType())))
rmse = df.select(F.avg(F.col('squared_difference')).alias('rmse'))

print(df.show())
print(rmse.show())


Output:

+----+---------+----------+------------------+
|true|predicted|difference|squared_difference|
+----+---------+----------+------------------+
| 0.0|      1.0|      -1.0|               1.0|
| 1.0|      2.0|      -1.0|               1.0|
| 3.0|      5.0|      -2.0|               4.0|
| 1.0|      8.0|      -7.0|              49.0|
+----+---------+----------+------------------+

+-----+
| rmse|
+-----+
|13.75|
+-----+


Hope this helps!

Edit

Sorry, I forgot to take the square root of the result - the last line becomes:

rmse = df.select(F.sqrt(F.avg(F.col('squared_difference'))).alias('rmse'))


and the output becomes:

+------------------+
|              rmse|
+------------------+
|3.7080992435478315|
+------------------+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2021-01-23 12:13
              
            
            
                                                                       
If you want to calculate RMSE by group, a slight adaptation of the solution I proposed to your question

import pyspark.sql.functions as psf

def compute_RMSE(expected_col, actual_col):

  rmse = old_df.withColumn("squarederror",
                           psf.pow(psf.col(actual_col) - psf.col(expected_col),
                                   psf.lit(2)
                           ))
  .groupby('start_month', 'start_week')
  .agg(psf.avg(psf.col("squarederror")).alias("mse"))
  .withColumn("rmse", psf.sqrt(psf.col("mse")))

  return(rmse)


compute_RMSE("col1", "col2")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复