问题
Spark 1.6.1, Scala api.
For a dataframe, I need to replace all null value of a certain column with 0. I have 2 ways to do this. 1.
myDF.withColumn("pipConfidence", when($"mycol".isNull, 0).otherwise($"mycol"))
2.
myDF.na.fill(0, Seq("mycol"))
Are they essentially the same or one way is preferred?
Thank you!
回答1:
There are not the same but performance should be similar. na.fill
uses coalesce
but it replaces NaN
and NULLs
not only NULLS
.
val y = when($"x" === 0, $"x".cast("double")).when($"x" === 1, lit(null)).otherwise(lit("NaN").cast("double"))
val df = spark.range(0, 3).toDF("x").withColumn("y", y)
df.withColumn("y", when($"y".isNull(), 0.0).otherwise($"y")).show()
df.na.fill(0.0, Seq("y")).show()
来源:https://stackoverflow.com/questions/40247045/spark-replacing-null-with-0-performance-comparison