Implement a java UDF and call it from pyspark

后端 未结 2 977
时光取名叫无心
时光取名叫无心 2021-02-15 10:27

I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.

If it were a simple python I would do something like:

<
2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-15 11:06

    I got this working with the help of another question (and answer) of your own about UDAFs.

    Spark provides a udf() method for wrapping Scala FunctionN, so we can wrap the Java function in Scala and use that. Your Java method needs to be static or on a class that implements Serializable.

    package com.example
    
    import org.apache.spark.sql.UserDefinedFunction
    import org.apache.spark.sql.functions.udf
    
    class MyUdf extends Serializable {
      def getUdf: UserDefinedFunction = udf(() => MyJavaClass.MyJavaMethod())
    }
    

    Usage in PySpark:

    def my_udf():
        from pyspark.sql.column import Column, _to_java_column, _to_seq
        pcls = "com.example.MyUdf"
        jc = sc._jvm.java.lang.Thread.currentThread() \
            .getContextClassLoader().loadClass(pcls).newInstance().getUdf().apply
        return Column(jc(_to_seq(sc, [], _to_java_column)))
    
    rdd1 = sc.parallelize([{'c1': 'a'}, {'c1': 'b'}, {'c1': 'c'}])
    df1 = rdd1.toDF()
    df2 = df1.withColumn('mycol', my_udf())
    

    As with the UDAF in your other question and answer, we can pass columns into it with return Column(jc(_to_seq(sc, ["col1", "col2"], _to_java_column)))

提交回复
热议问题