Access Array column in Spark

后端 未结 2 408
时光取名叫无心
时光取名叫无心 2020-12-01 15:16

A Spark DataFrame contains a column of type Array[Double]. It throw a ClassCastException exception when I try to get it back in a map() function. The following Scala code ge

相关标签:
2条回答
  • 2020-12-01 15:32

    This approach can also be considered :

      val tuples = Seq(("Abhishek", "Sengupta", Seq("MATH", "PHYSICS")))
      val dF = tuples.toDF("firstName", "lastName", "subjects")
    
      case class StudentInfo(fName: String, lName: String, subjects: Seq[String])
    
      val students = dF
        .collect()
        .map(row => StudentInfo(row.getString(0), row.getString(1), row.getSeq(2)))
    
      students.foreach(println)
    
    0 讨论(0)
  • 2020-12-01 15:46

    ArrayType is represented in a Row as a scala.collection.mutable.WrappedArray. You can extract it using for example

    val arr: Seq[Double] = r.getAs[Seq[Double]]("x")
    

    or

    val i: Int = ???
    val arr = r.getSeq[Double](i)
    

    or even:

    import scala.collection.mutable.WrappedArray
    
    val arr: WrappedArray[Double] = r.getAs[WrappedArray[Double]]("x")
    

    If DataFrame is relatively thin then pattern matching could be a better approach:

    import org.apache.spark.sql.Row
    
    df.rdd.map{case Row(x: Seq[Double]) => (x.toArray, x.sum)}
    

    although you have to keep in mind that the type of the sequence is unchecked.

    In Spark >= 1.6 you can also use Dataset as follows:

    df.select("x").as[Seq[Double]].rdd
    
    0 讨论(0)
提交回复
热议问题