Are setup and cleanup methods called in each mapper and reducer tasks respectively? Or are they called only once at the start of overall mapper and reducer jobs?
One clarification is helpful. The setup/cleanup methods are used for initialization and clean up at task level. Within a task, first initialization happens with a single call to setup() method and then all calls to map() [or reduce()] function will be done. After that another single call will be made to cleanup() method before exiting the task.
According to the mapreduce documentation setup and cleanup are called for each Mapper and Reducer tasks.
on the reducer you can on the job do job.setNumReduceTasks(1); and that way the setup and clean-up of the reducer only will be run once.
It's called per Mapper task or Reducer task. Here is the hadoop code.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
}
} finally {
cleanup(context);
}
}
They are called for each task, so if you have 20 mappers running, the setup / cleanup will be called for each one.
One gotcha is the standard run method for both Mapper and Reducer does not catch exceptions around the map / reduce methods - so if an exception is thrown in these methods, the clean up method will not be called.
2020 Edit: As noted in the comments, this statement from 2012 (Hadoop 0.20) is no longer true, the cleanup is called as part of a finally block.