Spark 2.2 Scala DataFrame select from string array, catching errors

前端 未结 1 1879
执念已碎
执念已碎 2021-01-16 00:46

I\'m new to SparkSQL/Scala and I\'m struggling with a couple seemingly simple tasks.

I\'m trying to build some dynamic SQL from a Scala String Array. I\'m trying t

1条回答
  •  执笔经年
    2021-01-16 00:59

    You can just use variadic arguments:

    val df = Seq(("a", "1", "c"), ("foo", "bar", "baz")).toDF("a", "b", "c")
    val typedCols = Array("a", "cast(b as int) b", "c")
    df.selectExpr(typedCols: _*).show
    
    +---+----+---+
    |  a|   b|  c|
    +---+----+---+
    |  a|   1|  c|
    |foo|null|baz|
    +---+----+---+
    

    but personally I prefer columns:

    val typedCols = Array($"a", $"b" cast "int", $"c")
    df.select(typedCols: _*).show
    

    How would I get a DataFrame result with all the good records that passed the typing and then throw all the bad records in some kind of error bucket?

    Data that failed to cast is NULL. To find good records use na.drop:

    val result = df.selectExpr(typedCols: _*)
    val good = result.na.drop()
    

    To find bad check if any is NULL

    import org.apache.spark.sql.functions.col
    
    val bad = result.where(result.columns.map(col(_).isNull).reduce(_ || _))
    

    To get unmatched data:

    • If typedCols are Seq[Column] you can

      df.where(typedCols.map(_.isNull).reduce(_ || _))  
      
    • If typedCols are Seq[String] you can:

      import org.apache.spark.sql.functions.expr
      
      df.where(typedCols.map(expr(_).isNull).reduce(_ || _))  
      

    0 讨论(0)
提交回复
热议问题