I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.
If it were a simple python I would do something like:
<
In lines with https://dzone.com/articles/pyspark-java-udf-integration-1 you could define UDF1 with in Java using
public class AddNumber implements UDF1 {
@Override
public Long call(Long num) throws Exception {
return (num + 5);
}
}
And then after adding the jar to your pyspark with --package
you can use it in pyspark as:
from pyspark.sql import functions as F
from pyspark.sql.types import LongType
>>> df = spark.createDataFrame([float(i) for i in range(100)], FloatType()).toDF("a")
>>> spark.udf.registerJavaFunction("addNumber", "com.example.spark.AddNumber", LongType())
>>> df.withColumn("b", F.expr("addNumber(a)")).show(5)
+---+---+
| a| b|
+---+---+
|0.0| 5|
|1.0| 6|
|2.0| 7|
|3.0| 8|
|4.0| 8|
+---+---+
only showing top 5 rows