I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.
If it were a simple python I would do something like:
<In lines with https://dzone.com/articles/pyspark-java-udf-integration-1 you could define UDF1 with in Java using
public class AddNumber implements UDF1<Long, Long> {
@Override
public Long call(Long num) throws Exception {
return (num + 5);
}
}
And then after adding the jar to your pyspark with --package <your-jar>
you can use it in pyspark as:
from pyspark.sql import functions as F
from pyspark.sql.types import LongType
>>> df = spark.createDataFrame([float(i) for i in range(100)], FloatType()).toDF("a")
>>> spark.udf.registerJavaFunction("addNumber", "com.example.spark.AddNumber", LongType())
>>> df.withColumn("b", F.expr("addNumber(a)")).show(5)
+---+---+
| a| b|
+---+---+
|0.0| 5|
|1.0| 6|
|2.0| 7|
|3.0| 8|
|4.0| 8|
+---+---+
only showing top 5 rows
I got this working with the help of another question (and answer) of your own about UDAFs.
Spark provides a udf()
method for wrapping Scala FunctionN
, so we can wrap the Java function in Scala and use that. Your Java method needs to be static or on a class that implements Serializable
.
package com.example
import org.apache.spark.sql.UserDefinedFunction
import org.apache.spark.sql.functions.udf
class MyUdf extends Serializable {
def getUdf: UserDefinedFunction = udf(() => MyJavaClass.MyJavaMethod())
}
Usage in PySpark:
def my_udf():
from pyspark.sql.column import Column, _to_java_column, _to_seq
pcls = "com.example.MyUdf"
jc = sc._jvm.java.lang.Thread.currentThread() \
.getContextClassLoader().loadClass(pcls).newInstance().getUdf().apply
return Column(jc(_to_seq(sc, [], _to_java_column)))
rdd1 = sc.parallelize([{'c1': 'a'}, {'c1': 'b'}, {'c1': 'c'}])
df1 = rdd1.toDF()
df2 = df1.withColumn('mycol', my_udf())
As with the UDAF in your other question and answer, we can pass columns into it with return Column(jc(_to_seq(sc, ["col1", "col2"], _to_java_column)))