Pyspark replace NaN with NULL

后端 未结 2 467
时光取名叫无心
时光取名叫无心 2021-01-05 08:30

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

I

相关标签:
2条回答
  • 2021-01-05 09:19
    df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
    df.show()
    
    +----+---+        
    |   a|  b|
    +----+---+
    |   1|NaN|
    |null|1.0|
    +----+---+
    
    df = df.replace(float('nan'), None)
    df.show()
    
    +----+----+
    |   a|   b|
    +----+----+
    |   1|null|
    |null| 1.0|
    +----+----+
    

    You can use the .replace function to change to null values in one line of code.

    0 讨论(0)
  • 2021-01-05 09:29

    I finally found the answer after Googling around a bit.

    df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
    df.show()
    
    +----+---+
    |   a|  b|
    +----+---+
    |   1|NaN|
    |null|1.0|
    +----+---+
    
    import pyspark.sql.functions as F
    columns = df.columns
    for column in columns:
        df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))
    
    sqlContext.registerDataFrameAsTable(df, "df2")
    sql('select * from df2').show()
    
    +----+----+
    |   a|   b|
    +----+----+
    |   1|null|
    |null| 1.0|
    +----+----+
    

    It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

    0 讨论(0)
提交回复
热议问题