Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

前端 未结 2 1838
自闭症患者
自闭症患者 2021-01-23 11:26

I have a function that calculates RMSE for the preds and actuals of an entire dataframe:

def calculate_rmse(df, actual_column, prediction_column):
    RMSE = F.u         


        
相关标签:
2条回答
  • 2021-01-23 12:07

    I don't think you need a UDF for this - I think you should be able to take the difference between the two columns (df.withColumn('difference', col('true') - col('pred'))), then compute the square of that column (df.withColumn('squared_difference', pow(col('difference'), lit(2).astype(IntegerType()))), and compute the average of the column (df.withColumn('rmse', avg('squared_difference'))). Putting it all together with an example:

    from pyspark.sql import SparkSession
    from pyspark.sql import SQLContext
    import pyspark.sql.functions as F
    from pyspark.sql.types import IntegerType
    
    spark = SparkSession.builder.getOrCreate()
    
    sql_context = SQLContext(spark.sparkContext)
    
    df = sql_context.createDataFrame([(0.0, 1.0),
                                      (1.0, 2.0),
                                      (3.0, 5.0),
                                      (1.0, 8.0)], schema=['true', 'predicted'])
    
    df = df.withColumn('difference', F.col('true') - F.col('predicted'))
    df = df.withColumn('squared_difference', F.pow(F.col('difference'), F.lit(2).astype(IntegerType())))
    rmse = df.select(F.avg(F.col('squared_difference')).alias('rmse'))
    
    print(df.show())
    print(rmse.show())
    

    Output:

    +----+---------+----------+------------------+
    |true|predicted|difference|squared_difference|
    +----+---------+----------+------------------+
    | 0.0|      1.0|      -1.0|               1.0|
    | 1.0|      2.0|      -1.0|               1.0|
    | 3.0|      5.0|      -2.0|               4.0|
    | 1.0|      8.0|      -7.0|              49.0|
    +----+---------+----------+------------------+
    
    +-----+
    | rmse|
    +-----+
    |13.75|
    +-----+
    

    Hope this helps!

    Edit

    Sorry, I forgot to take the square root of the result - the last line becomes:

    rmse = df.select(F.sqrt(F.avg(F.col('squared_difference'))).alias('rmse'))
    

    and the output becomes:

    +------------------+
    |              rmse|
    +------------------+
    |3.7080992435478315|
    +------------------+
    
    0 讨论(0)
  • 2021-01-23 12:13

    If you want to calculate RMSE by group, a slight adaptation of the solution I proposed to your question

    import pyspark.sql.functions as psf
    
    def compute_RMSE(expected_col, actual_col):
    
      rmse = old_df.withColumn("squarederror",
                               psf.pow(psf.col(actual_col) - psf.col(expected_col),
                                       psf.lit(2)
                               ))
      .groupby('start_month', 'start_week')
      .agg(psf.avg(psf.col("squarederror")).alias("mse"))
      .withColumn("rmse", psf.sqrt(psf.col("mse")))
    
      return(rmse)
    
    
    compute_RMSE("col1", "col2")
    
    0 讨论(0)
提交回复
热议问题