How to override setup and cleanup methods in spark map function

橙三吉。 提交于 2019-12-11 06:58:01

问题


Suppose there is following map reduce job

Mapper:

setup() initializes some state

map() add data to state, no output

cleanup() ouput state to context

Reducer:

aggregare all states into one output

How such job could be implemented in spark?

Additional question: how such job could be implemented in scalding? I'm looking for example wich somehow makes the method overloadings...


回答1:


Spark map doesn't provide an equivalent of Hadoop setup and cleanup. It assumes that each call is independent and side effect free.

The closest equivalent you can get is to put required logic inside mapPartitions or mapPartitionsWithIndex with simplified template:

rdd.mapPartitions { iter => {
   ... // initalize state
   val result = ??? // compute result for iter
   ... // perform cleanup
   ... // return results as an Iterator[U]
}}



回答2:


A standard approach to setup in scala would be to use use a lazy val:

lazy val someSetupState = { .... }
data.map { x =>
  useState(someSetupState, x)
  ...

The above works as long as the someSetupState can be instantiated on the tasks (i.e. it does not use some local disk of the submitting node). This does not address cleanup. For cleanup, scalding has a method:

    TypedPipe[T]#onComplete(fn: () => Unit): TypedPipe[T]

which is run on each task at the end. Similar to the mapping example, you can do a shutdown:

    data.map { x =>
      useState(someSetupState, x)
    }
    .onComplete { () =>
      someSetupState.shutdown()
    }

I don't know the equivalent for spark.



来源:https://stackoverflow.com/questions/39947677/how-to-override-setup-and-cleanup-methods-in-spark-map-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!