How to return rows with Null values in pyspark dataframe?

后端 未结 2 1739
梦谈多话
梦谈多话 2021-01-20 00:19

I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull() on the dataframe:

df = df[df.         


        
相关标签:
2条回答
  • 2021-01-20 00:48

    You can filter the rows with where, reduce and a list comprehension. For example, given the following dataframe:

    df = sc.parallelize([
        (0.4, 0.3),
        (None, 0.11),
        (9.7, None), 
        (None, None)
    ]).toDF(["A", "B"])
    
    df.show()
    +----+----+
    |   A|   B|
    +----+----+
    | 0.4| 0.3|
    |null|0.11|
    | 9.7|null|
    |null|null|
    +----+----+
    

    Filtering the rows with some null value could be achieved with:

    import pyspark.sql.functions as f
    from functools import reduce
    
    df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in df.columns))).show()
    

    Which gives:

    +----+----+
    |   A|   B|
    +----+----+
    |null|0.11|
    | 9.7|null|
    |null|null|
    +----+----+
    

    In the condition statement you have to specify if any (or, |), all (and, &), etc.

    0 讨论(0)
  • 2021-01-20 01:06

    This is how you can do this in scala

    import org.apache.spark.sql.functions._
    
    case class Test(id:Int, weight:Option[Int], age:Int, gender: Option[String])
    
    val df1 = Seq(Test(1, Some(100), 23, Some("Male")), Test(2, None, 25, None), Test(3, None, 33, Some("Female"))).toDF()
        
    display(df1.filter(df1.columns.map(c => col(c).isNull).reduce((a,b) => a || b)))
    
    0 讨论(0)
提交回复
热议问题