I am using spark-sql-2.3.1v and have the below scenario:
Given a dataset:
val ds = Seq(
(1, \"x1\", \"y1\", \"0.1992019\"),
(2, null, \"y2\", \"2
You can also use filter
to filter out null values.
scala> val operationCol = "col_x" // for one column
operationCol: String = col_x
scala> ds.filter(col(operationCol).isNotNull).show(false)
+---+-----+-----+---------+
|id |col_x|col_y|value |
+---+-----+-----+---------+
|1 |x1 |y1 |0.1992019|
|3 |x3 |null |15.34567 |
|5 |x4 |y4 |0 |
+---+-----+-----+---------+
scala> val operationCol = Seq("col_x","col_y") // For multiple Columns
operationCol: Seq[String] = List(col_x, col_y)
scala> ds.filter(operationCol.map(col(_).isNotNull).reduce(_ && _)).show
+---+-----+-----+---------+
| id|col_x|col_y| value|
+---+-----+-----+---------+
| 1| x1| y1|0.1992019|
| 5| x4| y4| 0|
+---+-----+-----+---------+
You can use df.na.drop() to drop rows that contains nulls. The drop function can take a list of the columns you want to consider as input, so in this case, you can write it as follows:
val newDf = df.na.drop(Seq(operationCol))
This will create a new dataframe newDf
with where all rows in operationCol
have been removed.