Let say I have a DataFrame as follow :
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = s
Spark >= 2.4:
It is possible to use arrays_zip
with cast
:
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip(
$"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))
where cast
is required to rename nested fields - without it Spark uses automatically generated names 0
, 1
, ... n
.
Spark < 2.4:
You can use an UDF like this:
import org.apache.spark.sql.Row
case class Record(id: String, size: Int)
val dropUseless = udf((xs: Seq[Row]) => xs.map{
case Row(id: String, size: Int, _) => Record(id, size)
})
df.select(dropUseless($"subClasss"))