Create a new dataset based given operation column

前端 未结 2 1384
梦毁少年i
梦毁少年i 2020-12-02 03:30

I am using spark-sql-2.3.1v and have the below scenario:

Given a dataset:

val ds = Seq(
  (1, \"x1\", \"y1\", \"0.1992019\"),
  (2, null, \"y2\", \"2         


        
相关标签:
2条回答
  • 2020-12-02 03:40

    You can also use filter to filter out null values.

    
    scala> val operationCol = "col_x" // for one column
    operationCol: String = col_x
    
    scala> ds.filter(col(operationCol).isNotNull).show(false)
    +---+-----+-----+---------+
    |id |col_x|col_y|value    |
    +---+-----+-----+---------+
    |1  |x1   |y1   |0.1992019|
    |3  |x3   |null |15.34567 |
    |5  |x4   |y4   |0        |
    +---+-----+-----+---------+
    
    
    scala> val operationCol = Seq("col_x","col_y") // For multiple Columns
    operationCol: Seq[String] = List(col_x, col_y)
    
    scala> ds.filter(operationCol.map(col(_).isNotNull).reduce(_ && _)).show
    +---+-----+-----+---------+
    | id|col_x|col_y|    value|
    +---+-----+-----+---------+
    |  1|   x1|   y1|0.1992019|
    |  5|   x4|   y4|        0|
    +---+-----+-----+---------+
    
    
    0 讨论(0)
  • 2020-12-02 03:44

    You can use df.na.drop() to drop rows that contains nulls. The drop function can take a list of the columns you want to consider as input, so in this case, you can write it as follows:

    val newDf = df.na.drop(Seq(operationCol))
    

    This will create a new dataframe newDf with where all rows in operationCol have been removed.

    0 讨论(0)
提交回复
热议问题