Pyspark replace NaN with NULL

前提是你 提交于 2019-12-30 11:31:28

问题


I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

I tried something like this:

some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)

But I got the following error:

ValueError: value should be a float, int, long, string, bool or dict

So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.


回答1:


df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+        
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

df = df.replace(float('nan'), None)
df.show()

+----+----+
|   a|   b|
+----+----+
|   1|null|
|null| 1.0|
+----+----+

You can use the .replace function to change to null values in one line of code.




回答2:


I finally found the answer after Googling around a bit.

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

import pyspark.sql.functions as F
columns = df.columns
for column in columns:
    df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))

sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()

+----+----+
|   a|   b|
+----+----+
|   1|null|
|null| 1.0|
+----+----+

It doesn't use na.fill(), but it accomplished the same result, so I'm happy.



来源:https://stackoverflow.com/questions/50992713/pyspark-replace-nan-with-null

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!