How to get completed job's statistics executed by Hadoop?

问题

When we run data intensive job over Hadoop. Hadoop executes the job. Now what i want is when the job is completed. it will give me the statistics regarding executed job i.e; time consumed, mapper quantity, reducer quantity and other useful information.

The information displayed in browser like job tracker, data node during the job execution. But how can i get the statistics in my application which runs the job over Hadoop and gives me results like a report at the end of job completion. My application is in JAVA

Any API which can help me. Suggestions will be appreciated.

回答1:

Look into the following methods of JobClient:

getMapTaskReports(JobID)
getReduceTaskReports(JobID)

Both these calls return arrays of TaskReport object, from which you can pull start / finish times, and individual counters for each task

回答2:

Chirs is correct. The documentation of TaskReport states that org.apache.hadoop.mapred.TaskReport inherits those methods from org.apache.hadoop.mapreduce.TaskReport. So, one could get such values.

Here are the codes to get the start and end time of a job, grouped for each Map and Reduce tasks.

import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobStatus;
import org.apache.hadoop.conf.Configuration;
import java.net.InetSocketAddress;
import java.util.*;
import org.apache.hadoop.mapred.TaskReport;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.util.StringUtils;
import java.text.SimpleDateFormat;

public class mini{
        public static void main(String args[]){
                String jobTrackerHost = "192.168.151.14";
                int jobTrackerPort = 54311;

                try{
                        Configuration conf = new Configuration();
                        JobClient jobClient = new JobClient(new InetSocketAddress(jobTrackerHost, jobTrackerPort), conf);
                        JobStatus[] activeJobs = jobClient.jobsToComplete();
                        SimpleDateFormat dateFormat = new SimpleDateFormat("d-MMM-yyyy HH:mm:ss");
                        for(JobStatus js: activeJobs){
                                System.out.println(js.getJobID());
                                RunningJob runningjob = jobClient.getJob(js.getJobID());
                                            while(runningjob.isComplete() == false){ /*Wait till the job completes.*/}
                                TaskReport[] maptaskreports = jobClient.getMapTaskReports(js.getJobID());
                                for(TaskReport tr: maptaskreports){
                                        System.out.println("Task ID: "+tr.getTaskID()+" Start TIme: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getStartTime(), 0)+" Finish Time: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getFinishTime(), tr.getStartTime()));
                                }
                                TaskReport[] reducetaskreports = jobClient.getReduceTaskReports(js.getJobID());
                                for(TaskReport tr: reducetaskreports){
                                        System.out.println("Task ID: "+tr.getTaskID()+" Start TIme: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getStartTime(), 0)+" Finish Time: "+StringUtils.getFormattedTimeWithDiff(dateFormat, tr.getFinishTime(), tr.getStartTime()));
                                }

                        }
                }catch(Exception ex){
                        ex.printStackTrace();
                }
        }
}

This is a simple example to get the Start and Finish time of a running job. You can in the way you want.

And here is the run of this program for a "Word Count" MapReduce job.

[root@dev1-slave1 ~]# java -classpath /usr/lib/hadoop/hadoop-core.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/hadoop/lib/jackson-mapper-asl-1.8.8.jar:/usr/lib/hadoop/lib/commons-logging-1.1.1.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/usr/lib/hadoop/lib/commons-lang-2.4.jar:. mini
job_201501151144_0042
Task ID: task_201501151144_0042_m_000000 Start TIme: 16-Jan-2015 17:07:35 Finish Time: 16-Jan-2015 17:07:43 (7sec)
Task ID: task_201501151144_0042_m_000001 Start TIme: 16-Jan-2015 17:07:35 Finish Time: 16-Jan-2015 17:07:56 (20sec)
Task ID: task_201501151144_0042_m_000002 Start TIme: 16-Jan-2015 17:07:35 Finish Time: 16-Jan-2015 17:07:43 (7sec)
Task ID: task_201501151144_0042_m_000003 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:07:53 (10sec)
Task ID: task_201501151144_0042_m_000004 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:07:53 (10sec)
Task ID: task_201501151144_0042_r_000000 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:08:00 (17sec)
Task ID: task_201501151144_0042_r_000001 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:08:05 (22sec)
Task ID: task_201501151144_0042_r_000002 Start TIme: 16-Jan-2015 17:07:43 Finish Time: 16-Jan-2015 17:08:05 (21sec)

Its good to open the desired jsp files of hadoop in its mapreduce/src/webapps/job/ directory and figure out how JOBTRACKER Web UI is displaying information.

I have derived above codes from jobtasks.jsp.

Hope it helps. :)

来源：https://stackoverflow.com/questions/16180654/how-to-get-completed-jobs-statistics-executed-by-hadoop

标签

java

Hadoop

HDFS

hadoop-plugins