I am aware of method to add a new column to a Spark DataSet using .withColumn()
and a UDF
, which returns a DataFrame. I am also aware that, we can conv
In the type-safe world of Dataset
s you'd map an structure into another.
That is, for each transformation, we need schema representations of the data (as it is needed for RDDs). To access 'c' above, we need to create a new schema that provides access to it.
case class A(a:String)
case class BC(b:String, c:String)
val f:A => BC = a=> BC(a.a,"c") // Transforms an A into a BC
val data = (1 to 10).map(i => A(i.toString))
val dsa = spark.createDataset(data)
// dsa: org.apache.spark.sql.Dataset[A] = [a: string]
val dsb = dsa.map(f)
//dsb: org.apache.spark.sql.Dataset[BC] = [b: string, c: string]
Just to add to @maasg's excellent answer...
How does DataSet's type safety comes into play here, if we are still following traditional DF approach (i.e passing column names as a string for UDF's input)
Let me answer this with another question "Who is we in 'we are still following...'"? If you think me, I disagree and only use DataFrames when I'm too lazy to create a case class to describe the data set to work with.
My answer to UDFs is to stay away from UDFs unless they are very simple and there is nothing Spark Optimizer could optimize. Yes, I do believe UDFs are far too easy to define and use that I myself got carried away far too many times to (over)use them. There are around 239 functions available in Spark SQL 2.0 that you could think hard(er) to think of a solution without UDFs but standard functions.
scala> spark.version
res0: String = 2.1.0-SNAPSHOT
scala> spark.catalog.listFunctions.count
res1: Long = 240
(240 above is because I registered one UDF).
You should always use standard functions since they can be optimized. Spark can control what you're doing and hence optimize your queries.
You should also use Datasets (not Dataset[Row]
i.e. DataFrame
) because they gives you access to type-safe access to fields.
(Yet some of the Dataset "goodies" can't be optimized either since Dataset programming is all about Scala custom code that Spark can't optimize as much as DataFrame-based code).
Is there an "Object Oriented Way" of accessing columns(without passing column names as a string) like we used to do with RDD, for appending a new column.
Yes. Of course. Use case classes to define your schema of Datasets and use the field(s). Both to access and add (that's what @maasg responded to nicely so I'm not gonna repeat his words here).
How to access the new column in normal operations like map, filter etc?
Easy...again. Use the case class that describes (the schema of) your data sets. How do you add a new "something" to an existing object? You can't unless it somehow has accepted to have a new column, doesn't it?
In ""Object Oriented Way" of accessing columns or appending a new column." if your column is an attribute of a case class, you can't say "This is a class that describes the data and at the same time say this is a class that may have a new attribute". It's not possible in OOP/FP, is it?
That's why adding a new column boils down to use another case class or use withColumn
. What's wrong with that? I think there is...simply...nothing wrong with that.