问题
Suppose there is following map reduce job
Mapper:
setup() initializes some state
map() add data to state, no output
cleanup() ouput state to context
Reducer:
aggregare all states into one output
How such job could be implemented in spark?
Additional question: how such job could be implemented in scalding? I'm looking for example wich somehow makes the method overloadings...
回答1:
Spark map
doesn't provide an equivalent of Hadoop setup
and cleanup
. It assumes that each call is independent and side effect free.
The closest equivalent you can get is to put required logic inside mapPartitions
or mapPartitionsWithIndex
with simplified template:
rdd.mapPartitions { iter => {
... // initalize state
val result = ??? // compute result for iter
... // perform cleanup
... // return results as an Iterator[U]
}}
回答2:
A standard approach to setup in scala would be to use use a lazy val:
lazy val someSetupState = { .... }
data.map { x =>
useState(someSetupState, x)
...
The above works as long as the someSetupState
can be instantiated on the tasks (i.e. it does not use some local disk of the submitting node). This does not address cleanup. For cleanup, scalding has a method:
TypedPipe[T]#onComplete(fn: () => Unit): TypedPipe[T]
which is run on each task at the end. Similar to the mapping example, you can do a shutdown:
data.map { x =>
useState(someSetupState, x)
}
.onComplete { () =>
someSetupState.shutdown()
}
I don't know the equivalent for spark.
来源:https://stackoverflow.com/questions/39947677/how-to-override-setup-and-cleanup-methods-in-spark-map-function