How to select a subset of fields from an array column in Spark?

前端 未结 1 1670
天涯浪人
天涯浪人 2021-01-02 23:15

Let say I have a DataFrame as follow :

case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = s         


        
相关标签:
1条回答
  • 2021-01-03 00:17

    Spark >= 2.4:

    It is possible to use arrays_zip with cast:

    import org.apache.spark.sql.functions.arrays_zip
    
    df.select(arrays_zip(
      $"subClasss.id", $"subClasss.size"
    ).cast("array<struct<id:string,size:int>>"))
    

    where cast is required to rename nested fields - without it Spark uses automatically generated names 0, 1, ... n.

    Spark < 2.4:

    You can use an UDF like this:

    import org.apache.spark.sql.Row
    
    case class Record(id: String, size: Int)
    
    val dropUseless = udf((xs: Seq[Row]) =>  xs.map{
      case Row(id: String, size: Int, _) => Record(id, size)
    })
    
    df.select(dropUseless($"subClasss"))
    
    0 讨论(0)
提交回复
热议问题