How can I submit a Cascading job to a remote YARN cluster from Java?

问题

I know that I can submit a Cascading job by packaging it into a JAR, as detailed in the Cascading user guide. That job will then run on my cluster if I manually submit it using hadoop jar CLI command.

However, in the original Hadoop 1 Cascading version, it was possible to submit a job to the cluster by setting certain properties on the Hadoop JobConf. Setting fs.defaultFS and mapred.job.tracker caused the local Hadoop library to automatically attempt to submit the job to the Hadoop1 JobTracker. However, setting these properties does not seem to work in the newer version. Submitting to a CDH5 5.2.1 Hadoop cluster using Cascading version 2.5.3 (which lists CDH5 as a supported platform) leads to an IPC exception when negotiating with the server, as detailed below.

I believe that this platform combination -- Cascading 2.5.6, Hadoop 2, CDH 5, YARN, and the MR1 API for submission -- is a supported combination based on the compatibility table (see under "Prior Releases" heading). And submitting the job using hadoop jar works fine on this same cluster. Port 8031 is open between the submitting host and the ResourceManager. An error with the same message is found in the ResourceManager logs on the server side.

I am using the cascading-hadoop2-mr1 library.

Exception in thread "main" cascading.flow.FlowException: unhandled exception
    at cascading.flow.BaseFlow.complete(BaseFlow.java:894)
    at WordCount.main(WordCount.java:91)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): Unknown rpc kind in rpc headerRPC_WRITABLE
    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
    at org.apache.hadoop.mapred.$Proxy11.getStagingAreaDir(Unknown Source)
    at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1368)
    at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)
    at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Demo code is below, which is basically identical to the WordCount sample from the Cascading user guide.

public class WordCount {

    public static void main(String[] args) {
        String inputPath = "/user/vagrant/wordcount/input";
        String outputPath = "/user/vagrant/wordcount/output";

        Scheme sourceScheme = new TextLine( new Fields( "line" ) );
        Tap source = new Hfs( sourceScheme, inputPath );

        Scheme sinkScheme = new TextDelimited( new Fields( "word", "count" ) );
        Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

        Pipe assembly = new Pipe( "wordcount" );


        String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
        Function function = new RegexGenerator( new Fields( "word" ), regex );
        assembly = new Each( assembly, new Fields( "line" ), function );


        assembly = new GroupBy( assembly, new Fields( "word" ) );

        Aggregator count = new Count( new Fields( "count" ) );
        assembly = new Every( assembly, count );

        Properties properties = AppProps.appProps()
            .setName( "word-count-application" )
            .setJarClass( WordCount.class )
            .buildProperties();

        properties.put("fs.defaultFS", "hdfs://192.168.30.101");
        properties.put("mapred.job.tracker", "192.168.30.101:8032");

        FlowConnector flowConnector = new HadoopFlowConnector( properties );
        Flow flow = flowConnector.connect( "word-count", source, sink, assembly );

        flow.complete();
    }
}

I've also tried setting a bunch of other properties to try to get it working:

mapreduce.jobtracker.address
mapreduce.framework.name
yarn.resourcemanager.address
yarn.resourcemanager.host
yarn.resourcemanager.hostname
yarn.resourcemanager.resourcetracker.address

None of these worked, they just cause the job to run in local mode (unless mapred.job.tracker is also set).

回答1:

I've now resolved this problem. It comes from trying to use the older Hadoop classes that Cloudera distributes, particularly JobClient. This will happen if you use hadoop-core with the provided 2.5.0-mr1-cdh5.2.1 version, or the hadoop-client dependency with this same version number. Although this claims to be the MR1 version, and we are using the MR1 API to submit, this version actually ONLY supports submission to the Hadoop1 JobTracker, and it does not support YARN.

In order to allow submitting to YARN, you must use the hadoop-client dependency with the non-MR1 2.5.0-cdh5.2.1 version, which still supports submission of MR1 jobs to YARN.

来源：https://stackoverflow.com/questions/27606807/how-can-i-submit-a-cascading-job-to-a-remote-yarn-cluster-from-java

标签

java

Hadoop

MapReduce

yarn

cascading