Spark - Task not serializable: How to work with complex map closures that call outside classes/objects?

后端 未结 2 1982
伪装坚强ぢ
伪装坚强ぢ 2021-01-31 19:44

Take a look at this question: Scala + Spark - Task not serializable: java.io.NotSerializableExceptionon. When calling function outside closure only on classes not objects.

相关标签:
2条回答
  • 2021-01-31 20:17

    I figured out how to do this myself!

    You simply need to serialize the objects before passing through the closure, and de-serialize afterwards. This approach just works, even if your classes aren't Serializable, because it uses Kryo behind the scenes. All you need is some curry. ;)

    Here's an example of how I did it:

    def genMapper(kryoWrapper: KryoSerializationWrapper[(Foo => Bar)])
                   (foo: Foo) : Bar = {
        kryoWrapper.value.apply(foo)
    }
    val mapper = genMapper(KryoSerializationWrapper(new Blah(abc))) _
    rdd.flatMap(mapper).collectAsMap()
    
    object Blah(abc: ABC) extends (Foo => Bar) {
        def apply(foo: Foo) : Bar = { //This is the real function }
    }
    

    Feel free to make Blah as complicated as you want, class, companion object, nested classes, references to multiple 3rd party libs.

    KryoSerializationWrapper referes to: https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/serialization/KryoSerializationWrapper.scala

    0 讨论(0)
  • 2021-01-31 20:37

    In case of using Java API you should avoid anonymous class when passing to the mapping function closure. Instead of doing map( new Function) you need a class that extends your function and pass that to the map(..) See: https://yanago.wordpress.com/2015/03/21/apache-spark/

    0 讨论(0)
提交回复
热议问题