I have a udf function:
def create_users_array(val):
""" Takes column of ints, returns column of arrays containing ints. """
return [val for _ in range(val)]
I call it like so:
df.withColumn("myArray", create_users_array(df["myNumber"]))
I pass it a dataframe column of integers, and it returns an array of that integer.
4 --> [4,4,4,4]
It was working until we upgraded from Python 2.7, and upgraded our EMR version (which I believe uses Pyspark 2.3)
Anyone know what is causing this?
Looks like this had something to do with the improvements made to UDFs in the newer version (or rather, deprecation of old syntax). Changing the udf decorator worked for me. @F.udf("array<int>")
--> @F.udf(ArrayType(IntegerType()))