Implement a java UDF and call it from pyspark

后端未结

关注

 2  983

时光取名叫无心 2021-02-15 10:27

I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.

If it were a simple python I would do something like:

2条回答

猫巷女王i (楼主)

2021-02-15 10:51

In lines with https://dzone.com/articles/pyspark-java-udf-integration-1 you could define UDF1 with in Java using

public class AddNumber implements UDF1 {

@Override
public Long call(Long num) throws Exception {
      return (num + 5);
   }
}

And then after adding the jar to your pyspark with --package

you can use it in pyspark as:

from pyspark.sql import functions as F
from pyspark.sql.types import LongType


>>> df = spark.createDataFrame([float(i) for i in range(100)], FloatType()).toDF("a")
>>> spark.udf.registerJavaFunction("addNumber", "com.example.spark.AddNumber", LongType())
>>> df.withColumn("b", F.expr("addNumber(a)")).show(5)
+---+---+
|  a|  b|
+---+---+
|0.0|  5|
|1.0|  6|
|2.0|  7|
|3.0|  8|
|4.0|  8|
+---+---+
only showing top 5 rows

0 讨论(0)

查看其它2个回答