How to override setup and cleanup methods in spark map function

问题

Suppose there is following map reduce job

Mapper:

setup() initializes some state

map() add data to state, no output

cleanup() ouput state to context

Reducer:

aggregare all states into one output

How such job could be implemented in spark?

Additional question: how such job could be implemented in scalding? I'm looking for example wich somehow makes the method overloadings...

回答1:

Spark map doesn't provide an equivalent of Hadoop setup and cleanup. It assumes that each call is independent and side effect free.

The closest equivalent you can get is to put required logic inside mapPartitions or mapPartitionsWithIndex with simplified template:

rdd.mapPartitions { iter => {
   ... // initalize state
   val result = ??? // compute result for iter
   ... // perform cleanup
   ... // return results as an Iterator[U]
}}

回答2:

A standard approach to setup in scala would be to use use a lazy val:

lazy val someSetupState = { .... }
data.map { x =>
  useState(someSetupState, x)
  ...

The above works as long as the someSetupState can be instantiated on the tasks (i.e. it does not use some local disk of the submitting node). This does not address cleanup. For cleanup, scalding has a method:

    TypedPipe[T]#onComplete(fn: () => Unit): TypedPipe[T]

which is run on each task at the end. Similar to the mapping example, you can do a shutdown:

    data.map { x =>
      useState(someSetupState, x)
    }
    .onComplete { () =>
      someSetupState.shutdown()
    }

I don't know the equivalent for spark.

来源：https://stackoverflow.com/questions/39947677/how-to-override-setup-and-cleanup-methods-in-spark-map-function

标签

scala

apache-spark

scalding