问题
I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column.
Here's what I'm trying to do:
>>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType())
>>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2'])
>>> a = a.withColumn('id', uuid_udf())
>>> a.collect()
[Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50-bae2-0ced7d72ef4f')]
>>> b = a.select('id')
>>> b.collect()
[Row(id='12ec9913-21e1-47bd-9c59-6ddbe2365247')] # Wanted this to be the same ID as above
Possible workaround: rand()
A possible workaround might be to use pyspark.sql.functions.rand() as my source of randomness. However, there are two problems:
1) I'd like to have letters, not just numbers, in the UUID, so that it doesn't need to be quite as long
2) Though it technically works, it produces ugly UUIDs:
>>> from pyspark.sql.functions import rand, round
>>> a = a.withColumn('id', round(rand() * 10e16))
>>> a.collect()
[Row(col1=1, col2=2, id=7.34745165108606e+16)]
回答1:
Use Spark built-in uuid function instead:
a = a.withColumn('id', expr("uuid()"))
b = a.select('id')
b.collect()
[Row(id='da301bea-4927-4b6b-a1cf-518dea8705c4')]
a.collect()
[Row(col1=1, col2=2, id='da301bea-4927-4b6b-a1cf-518dea8705c4')]
回答2:
The reason why your UUID keeps changing is because your dataframe is computed again and again after each action.
To stabilize your result, you can just use persist
or cache
(depending on the size of your dataframe).
df.persist()
df.show()
+---+--------------------+
| id| uuid|
+---+--------------------+
| 0|e3db115b-6b6a-424...|
+---+--------------------+
b = df.select("uuid")
b.show()
+--------------------+
| uuid|
+--------------------+
|e3db115b-6b6a-424...|
+--------------------+
来源:https://stackoverflow.com/questions/59843216/calculate-udf-once