Manipulating a dataframe within a Spark UDF

后端未结

关注

 3  529

I have a UDF that filters and selects values from a dataframe, but it runs into \"object not serializable\" error. Details below.

Suppose I have a dataframe df1 that has

相关标签:

3条回答

孤独总比滥情好

2021-01-21 10:32
You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.

Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.

Edit: If you want to join your bigDF with every small DF in smallDFs List, you can do:
```
import org.apache.spark.sql.functions._
val bigDF = // some processing
val smallDFs = Seq(someSmallDF1, someSmallDF2)
val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))
```
broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join
0 讨论(0)
发布评论:

提交评论
- 加载中...

囚心锁ツ

2021-01-21 10:48

import org.apache.spark.sql.functions._
val events = Seq (
(1,1,2,3,4),
(2,1,2,3,4),
(3,1,2,3,4),
(4,1,2,3,4),
(5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4")

var prev_amt5=0
var i=1
def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = {  
  if(i==1){
    i=i+1
    prev_amt5=0
  }else{
    i=i+1
  }
  if (ID == 0)
  {
    if(amt1==0)
    {
      val cur_amt5= 1
      prev_amt5=cur_amt5
      cur_amt5
    }else{
      val cur_amt5=1*(amt2+amt3)
      prev_amt5=cur_amt5
      cur_amt5
    }
  }else if (amt4==0 || (prev_amt5==0 & amt1==0)){
    val cur_amt5=0
    prev_amt5=cur_amt5
    cur_amt5
  }else{
    val cur_amt5=prev_amt5 +  amt2 + amt3 + amt4
    prev_amt5=cur_amt5
    cur_amt5
  }
}

val getamt5 = udf {(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) =>            
   getamt5value(ID,amt1,amt2,amt3,amt4)    
}
myDF.withColumn("amnt5", getamt5(myDF.col("ID"),myDF.col("amt1"),myDF.col("amt2"),myDF.col("amt3"),myDF.col("amt4"))).show()

0 讨论(0)

感情败类

2021-01-21 10:57
1) No, you can only use plain scala code within UDFs

2) If you interpreted your code correctly, you can achieve your goal with:
```
df2
  .join(
    df1.select($"ID",y_list.foldLeft(lit(0))(_ + _).as("Result")),Seq("ID")
  )
```
0 讨论(0)
发布评论:

提交评论
- 加载中...