I have timestamp dataset which is in format of
And I have written a udf in pyspark to process this dataset and return as Map of key values. But am getting below err
The error message says that in 27th line of udf you are calling some pyspark sql functions. It is line with abs()
so I suppose that somewhere above you call from pyspark.sql.functions import *
and it overrides python's abs()
function.
Just to be clear the problem a lot of guys are having is stemming from a single bad programming style. That is from blah import *
When you guys do
from pyspark.sql.functions import *
you overwrite a lot of python builtins functions. I strongly recommending importing functions like
import pyspark.sql.functions as f
# or
import pyspark.sql.functions as pyf
Make sure that you are initializing the Spark context. For example:
spark = SparkSession \
.builder \
.appName("myApp") \
.config("...") \
.getOrCreate()
sqlContext = SQLContext(spark)
productData = sqlContext.read.format("com.mongodb.spark.sql").load()
Or as in
spark = SparkSession.builder.appName('company').getOrCreate()
sqlContext = SQLContext(spark)
productData = sqlContext.read.format("csv").option("delimiter", ",") \
.option("quote", "\"").option("escape", "\"") \
.option("header", "true").option("inferSchema", "true") \
.load("/path/thecsv.csv")
Mariusz answer didn't really help me. So if you like me found this because it's the only result on google and you're new to pyspark (and spark in general), here's what worked for me.
In my case I was getting that error because I was trying to execute pyspark code before the pyspark environment had been set up.
Making sure that pyspark was available and set up before doing calls dependent on pyspark.sql.functions
fixed the issue for me.
This exception also arises when the udf
can not handle None
values.
For example the following code results in the same exception:
get_datetime = udf(lambda ts: to_timestamp(ts), DateType())
df = df.withColumn("datetime", get_datetime("ts"))
However this one does not:
get_datetime = udf(lambda ts: to_timestamp(ts) if ts is not None else None, DateType())
df = df.withColumn("datetime", get_datetime("ts"))