How to convert Row of a Scala DataFrame into case class most efficiently?

前端 未结 4 862
情书的邮戳
情书的邮戳 2020-12-23 09:49

Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a case class in my code. This can be done by matching

someRow          


        
相关标签:
4条回答
  • 2020-12-23 09:56

    As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like

    map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2))
    

    I find this to be easier, especially if the case class constructor only needs some of the fields from the row.

    0 讨论(0)
  • 2020-12-23 10:11

    DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.

    The conversion from Dataset[Row] to Dataset[Person] is very simple in spark

    val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")

    At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.

    // Create an Encoders for Java class (In my eg. Person is a JAVA class)
    // For scala case class you can pass Person without .class reference
    val personEncoder = Encoders.bean(Person.class) 
    
    val DStoProcess = DFtoProcess.as[Person](personEncoder)
    

    Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.

    Please refer to below link provided by databricks for further details

    https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

    0 讨论(0)
  • 2020-12-23 10:12

    Of course you can match a Row object into a case class. Let's suppose your SchemaType has many fields and you want to match a few of them into your case class. If you don't have null fields you can simply do:

    case class MyClass(a: Long, b: String, c: Int, d: String, e: String)
    
    dataframe.map {
      case Row(a: java.math.BigDecimal, 
        b: String, 
        c: Int, 
        _: String,
        _: java.sql.Date, 
        e: java.sql.Date,
        _: java.sql.Timestamp, 
        _: java.sql.Timestamp, 
        _: java.math.BigDecimal, 
        _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
    }
    

    This approach will fail in case of null values and also require you do explicitly define the type of each single field. If you have to handle null values you need to either discard all the rows containing null values by doing

    dataframe.na.drop()
    

    That will drop records even if the null fields are not the ones used in your pattern matching for your case class. Or if you want to handle it you could turn the Row object into a List and then use the option pattern:

    case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)
    
    dataframe.map(_.toSeq.toList match {
      case List(a: java.math.BigDecimal, 
        b: String, 
        c: Int, 
        _: String,
        _: java.sql.Date, 
        e: java.sql.Date,
        _: java.sql.Timestamp, 
        _: java.sql.Timestamp, 
        _: java.math.BigDecimal, 
        _: String) => MyClass(
          a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
    }
    

    Check this github project Sparkz () which will soon introduce a lot of libraries for simplifying the Spark and DataFrame APIs and make them more functional programming oriented.

    0 讨论(0)
  • 2020-12-23 10:17
    scala> import spark.implicits._    
    scala> val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
    df: org.apache.spark.sql.DataFrame = [id: int, name: string]
    
    scala> case class Student(id: Int, name: String)
    defined class Student
    
    scala> df.as[Student].collectAsList
    res6: java.util.List[Student] = [Student(1,james), Student(2,tony)]
    

    Here the spark in spark.implicits._ is your SparkSession. If you are inside the REPL the session is already defined as spark otherwise you need to adjust the name accordingly to correspond to your SparkSession.

    0 讨论(0)
提交回复
热议问题