I\'m currently working with Spark 2.1 and have a main script that calls a helper module that contains all my transformation methods. In other words:
main.py
help
Prior to Spark 2.2.0 UserDefinedFunction
eagerly creates UserDefinedPythonFunction
object, which represents Python UDF on JVM. This process requires access to SparkContext
and SparkSession
. If there are no active instances when UserDefinedFunction.__init__
is called, Spark will automatically initialize the contexts for you.
When you call SparkSession.Builder.getOrCreate
after importing UserDefinedFunction
object it returns existing SparkSession
instance and only some configuration changes can be applied (enableHiveSupport
is not among these).
To address this problem you should initialize SparkSession
before you import UDF:
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
from helper import reformat_udf
This behavior is described in SPARK-19163 and fixed in Spark 2.2.0. Other API improvements include decorator syntax (SPARK-19160) and improved docstrings handling (SPARK-19161).