Implement a java UDF and call it from pyspark

后端 未结 2 978
时光取名叫无心
时光取名叫无心 2021-02-15 10:27

I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.

If it were a simple python I would do something like:

<
相关标签:
2条回答
  • 2021-02-15 10:51

    In lines with https://dzone.com/articles/pyspark-java-udf-integration-1 you could define UDF1 with in Java using

    public class AddNumber implements UDF1<Long, Long> {
    
    @Override
    public Long call(Long num) throws Exception {
          return (num + 5);
       }
    }
    
    

    And then after adding the jar to your pyspark with --package <your-jar>

    you can use it in pyspark as:

    from pyspark.sql import functions as F
    from pyspark.sql.types import LongType
    
    
    >>> df = spark.createDataFrame([float(i) for i in range(100)], FloatType()).toDF("a")
    >>> spark.udf.registerJavaFunction("addNumber", "com.example.spark.AddNumber", LongType())
    >>> df.withColumn("b", F.expr("addNumber(a)")).show(5)
    +---+---+
    |  a|  b|
    +---+---+
    |0.0|  5|
    |1.0|  6|
    |2.0|  7|
    |3.0|  8|
    |4.0|  8|
    +---+---+
    only showing top 5 rows
    
    0 讨论(0)
  • 2021-02-15 11:06

    I got this working with the help of another question (and answer) of your own about UDAFs.

    Spark provides a udf() method for wrapping Scala FunctionN, so we can wrap the Java function in Scala and use that. Your Java method needs to be static or on a class that implements Serializable.

    package com.example
    
    import org.apache.spark.sql.UserDefinedFunction
    import org.apache.spark.sql.functions.udf
    
    class MyUdf extends Serializable {
      def getUdf: UserDefinedFunction = udf(() => MyJavaClass.MyJavaMethod())
    }
    

    Usage in PySpark:

    def my_udf():
        from pyspark.sql.column import Column, _to_java_column, _to_seq
        pcls = "com.example.MyUdf"
        jc = sc._jvm.java.lang.Thread.currentThread() \
            .getContextClassLoader().loadClass(pcls).newInstance().getUdf().apply
        return Column(jc(_to_seq(sc, [], _to_java_column)))
    
    rdd1 = sc.parallelize([{'c1': 'a'}, {'c1': 'b'}, {'c1': 'c'}])
    df1 = rdd1.toDF()
    df2 = df1.withColumn('mycol', my_udf())
    

    As with the UDAF in your other question and answer, we can pass columns into it with return Column(jc(_to_seq(sc, ["col1", "col2"], _to_java_column)))

    0 讨论(0)
提交回复
热议问题