Manipulating a dataframe within a Spark UDF

后端 未结 3 514
别跟我提以往
别跟我提以往 2021-01-21 10:18

I have a UDF that filters and selects values from a dataframe, but it runs into \"object not serializable\" error. Details below.

Suppose I have a dataframe df1 that has

相关标签:
3条回答
  • 2021-01-21 10:32

    You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.

    Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.

    Edit: If you want to join your bigDF with every small DF in smallDFs List, you can do:

    import org.apache.spark.sql.functions._
    val bigDF = // some processing
    val smallDFs = Seq(someSmallDF1, someSmallDF2)
    val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))
    

    broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join

    0 讨论(0)
  • 2021-01-21 10:48
    import org.apache.spark.sql.functions._
    val events = Seq (
    (1,1,2,3,4),
    (2,1,2,3,4),
    (3,1,2,3,4),
    (4,1,2,3,4),
    (5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4")
    
    var prev_amt5=0
    var i=1
    def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = {  
      if(i==1){
        i=i+1
        prev_amt5=0
      }else{
        i=i+1
      }
      if (ID == 0)
      {
        if(amt1==0)
        {
          val cur_amt5= 1
          prev_amt5=cur_amt5
          cur_amt5
        }else{
          val cur_amt5=1*(amt2+amt3)
          prev_amt5=cur_amt5
          cur_amt5
        }
      }else if (amt4==0 || (prev_amt5==0 & amt1==0)){
        val cur_amt5=0
        prev_amt5=cur_amt5
        cur_amt5
      }else{
        val cur_amt5=prev_amt5 +  amt2 + amt3 + amt4
        prev_amt5=cur_amt5
        cur_amt5
      }
    }
    
    val getamt5 = udf {(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) =>            
       getamt5value(ID,amt1,amt2,amt3,amt4)    
    }
    myDF.withColumn("amnt5", getamt5(myDF.col("ID"),myDF.col("amt1"),myDF.col("amt2"),myDF.col("amt3"),myDF.col("amt4"))).show()
    
    0 讨论(0)
  • 2021-01-21 10:57

    1) No, you can only use plain scala code within UDFs

    2) If you interpreted your code correctly, you can achieve your goal with:

    df2
      .join(
        df1.select($"ID",y_list.foldLeft(lit(0))(_ + _).as("Result")),Seq("ID")
      )
    
    0 讨论(0)
提交回复
热议问题