I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.
I
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
df = df.replace(float('nan'), None)
df.show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
You can use the .replace
function to change to null
values in one line of code.