A Spark DataFrame contains a column of type Array[Double]. It throw a ClassCastException exception when I try to get it back in a map() function. The following Scala code ge
This approach can also be considered :
val tuples = Seq(("Abhishek", "Sengupta", Seq("MATH", "PHYSICS")))
val dF = tuples.toDF("firstName", "lastName", "subjects")
case class StudentInfo(fName: String, lName: String, subjects: Seq[String])
val students = dF
.collect()
.map(row => StudentInfo(row.getString(0), row.getString(1), row.getSeq(2)))
students.foreach(println)
ArrayType
is represented in a Row
as a scala.collection.mutable.WrappedArray
. You can extract it using for example
val arr: Seq[Double] = r.getAs[Seq[Double]]("x")
or
val i: Int = ???
val arr = r.getSeq[Double](i)
or even:
import scala.collection.mutable.WrappedArray
val arr: WrappedArray[Double] = r.getAs[WrappedArray[Double]]("x")
If DataFrame
is relatively thin then pattern matching could be a better approach:
import org.apache.spark.sql.Row
df.rdd.map{case Row(x: Seq[Double]) => (x.toArray, x.sum)}
although you have to keep in mind that the type of the sequence is unchecked.
In Spark >= 1.6 you can also use Dataset
as follows:
df.select("x").as[Seq[Double]].rdd