Create a new dataset based given operation column

问题

I am using spark-sql-2.3.1v and have the below scenario:

Given a dataset:

val ds = Seq(
  (1, "x1", "y1", "0.1992019"),
  (2, null, "y2", "2.2500000"),
  (3, "x3", null, "15.34567"),
  (4, null, "y4", null),
  (5, "x4", "y4", "0")
   ).toDF("id","col_x", "col_y","value")

i.e.

+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  2| null|   y2|2.2500000|
|  3|   x3| null| 15.34567|
|  4| null|   y4|     null|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

Requirement:

I get operational column (i.e., operationCol) on which I need to perform some calculation from outside.

When performing some operations on column "col_x", I need to create a new dataset by filtering out all records which have "col_x" null values and return that new dataset.

Likewise when performing some operations on column "col_y", I need to create a new dataset by filter out all records which have "col_y" null values and return that new dataset.

Example:

val operationCol ="col_x";

if(operationCol === "col_x"){
  //filter out all rows which has "col_x" null and return that new dataset.
}

if(operationCol === "col_y"){
  //filter out all rows which has "col_y" null and return that new dataset.
}

When operationCol === "col_x" expected output:

+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  3|   x3| null| 15.34567|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

When operationCol === "col_y" expected output:

+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  2| null|   y2|2.2500000|
|  4| null|   y4|     null|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

How to achieve this expected output? In other words, how can the branching of dataframe be done? How to create a new dataframe/dataset in the middle of the flow?

回答1:

You can use df.na.drop() to drop rows that contains nulls. The drop function can take a list of the columns you want to consider as input, so in this case, you can write it as follows:

val newDf = df.na.drop(Seq(operationCol))

This will create a new dataframe newDf with where all rows in operationCol have been removed.

回答2:

You can also use filter to filter out null values.


scala> val operationCol = "col_x" // for one column
operationCol: String = col_x

scala> ds.filter(col(operationCol).isNotNull).show(false)
+---+-----+-----+---------+
|id |col_x|col_y|value    |
+---+-----+-----+---------+
|1  |x1   |y1   |0.1992019|
|3  |x3   |null |15.34567 |
|5  |x4   |y4   |0        |
+---+-----+-----+---------+


scala> val operationCol = Seq("col_x","col_y") // For multiple Columns
operationCol: Seq[String] = List(col_x, col_y)

scala> ds.filter(operationCol.map(col(_).isNotNull).reduce(_ && _)).show
+---+-----+-----+---------+
| id|col_x|col_y|    value|
+---+-----+-----+---------+
|  1|   x1|   y1|0.1992019|
|  5|   x4|   y4|        0|
+---+-----+-----+---------+

来源：https://stackoverflow.com/questions/61966806/create-a-new-dataset-based-given-operation-column

标签

apache-spark

apache-spark-sql

spark-streaming