Implement a java UDF and call it from pyspark

后端 未结 2 983
时光取名叫无心
时光取名叫无心 2021-02-15 10:27

I need to create a UDF to be used in pyspark python which uses a java object for its internal calculations.

If it were a simple python I would do something like:

<
2条回答
  •  猫巷女王i
    2021-02-15 10:51

    In lines with https://dzone.com/articles/pyspark-java-udf-integration-1 you could define UDF1 with in Java using

    public class AddNumber implements UDF1 {
    
    @Override
    public Long call(Long num) throws Exception {
          return (num + 5);
       }
    }
    
    

    And then after adding the jar to your pyspark with --package

    you can use it in pyspark as:

    from pyspark.sql import functions as F
    from pyspark.sql.types import LongType
    
    
    >>> df = spark.createDataFrame([float(i) for i in range(100)], FloatType()).toDF("a")
    >>> spark.udf.registerJavaFunction("addNumber", "com.example.spark.AddNumber", LongType())
    >>> df.withColumn("b", F.expr("addNumber(a)")).show(5)
    +---+---+
    |  a|  b|
    +---+---+
    |0.0|  5|
    |1.0|  6|
    |2.0|  7|
    |3.0|  8|
    |4.0|  8|
    +---+---+
    only showing top 5 rows
    

提交回复
热议问题