I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._
everywhere in our code. For example:
File A
class A {
def job(spark: SparkSession) = {
import spark.implcits._
//create dataset ds
val b = new B(spark)
b.doSomething(ds)
doSomething(ds)
}
private def doSomething(ds: Dataset[Foo], spark: SparkSession) = {
import spark.implicits._
ds.map(e => 1)
}
}
File B
class B(spark: SparkSession) {
def doSomething(ds: Dataset[Foo]) = {
import spark.implicits._
ds.map(e => "SomeString")
}
}
What I wanted to ask is if there's a cleaner way to be able to do
ds.map(e => "SomeString")
without importing implicits in every function where I do the map? If I don't import it, I get the following error:
Error:(53, 13) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
Something that would help a bit would be to do the import inside the class
or object
instead of each function. For your "File A" and "File B" examples:
File A
class A {
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
def job() = {
//create dataset ds
val b = new B(spark)
b.doSomething(ds)
doSomething(ds)
}
private def doSomething(ds: Dataset[Foo]) = {
ds.map(e => 1)
}
}
File B
class B(spark: SparkSession) {
import spark.implicits._
def doSomething(ds: Dataset[Foo]) = {
ds.map(e => "SomeString")
}
}
In this way, you get a manageable amount of imports
.
Unfortunately, to my knowledge there is no other way to reduce the number of imports even more. This is due to the need to the SparkSession
object when doing the actual import
. Hence, this is the best that can be done.
Update:
An even more convinient method is to create a Scala Trait
and combine it with an empty Object
. This allows for easy import of implicits at the top of each file while allowing extending the trait to use the SparkSession
object.
Example:
trait SparkJob {
val spark: SparkSession = SparkSession.builder.
.master(...)
.config(..., ....) // Any settings to be applied
.getOrCreate()
}
object SparkJob extends SparkJob {}
With this we can do the following for File A and B:
File A:
import SparkJob.spark.implicits._
class A extends SparkJob {
spark.sql(...) // Allows for usage of the SparkSession inside the class
...
}
File B:
import SparkJob.spark.implicits._
class B extends SparkJob {
...
}
Note that it's only necessary to extend SparkJob
for for the classes or objects that use the spark
object itself.
来源:https://stackoverflow.com/questions/45724290/workaround-for-importing-spark-implicits-everywhere