How to launch and configure an EMR cluster using boto

问题

I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows:

How to define the cluster to be used (by clusted_id)
How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes)

Am I missing something?

回答1:

Boto and the underlying EMR API is currently mixing the terms cluster and job flow, and job flow is being deprecated. I consider them synonyms.

You create a new cluster by calling the boto.emr.connection.run_jobflow() function. It will return the cluster ID which EMR generates for you.

First all the mandatory things:

#!/usr/bin/env python

import boto
import boto.emr
from boto.emr.instance_group import InstanceGroup

conn = boto.emr.connect_to_region('us-east-1')

Then we specify instance groups, including the spot price we want to pay for the TASK nodes:

instance_groups = []
instance_groups.append(InstanceGroup(
    num_instances=1,
    role="MASTER",
    type="m1.small",
    market="ON_DEMAND",
    name="Main node"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="CORE",
    type="m1.small",
    market="ON_DEMAND",
    name="Worker nodes"))
instance_groups.append(InstanceGroup(
    num_instances=2,
    role="TASK",
    type="m1.small",
    market="SPOT",
    name="My cheap spot nodes",
    bidprice="0.002"))

Finally we start a new cluster:

cluster_id = conn.run_jobflow(
    "Name for my cluster",
    instance_groups=instance_groups,
    action_on_failure='TERMINATE_JOB_FLOW',
    keep_alive=True,
    enable_debugging=True,
    log_uri="s3://mybucket/logs/",
    hadoop_version=None,
    ami_version="2.4.9",
    steps=[],
    bootstrap_actions=[],
    ec2_keyname="my-ec2-key",
    visible_to_all_users=True,
    job_flow_role="EMR_EC2_DefaultRole",
    service_role="EMR_DefaultRole")

We can also print the cluster ID if we care about that:

print "Starting cluster", cluster_id

回答2:

I believe the minimum amount of Python that will launch an EMR cluster with boto3 is:

import boto3

client = boto3.client('emr', region_name='us-east-1')

response = client.run_job_flow(
    Name="Boto3 test cluster",
    ReleaseLabel='emr-5.12.0',
    Instances={
        'MasterInstanceType': 'm4.xlarge',
        'SlaveInstanceType': 'm4.xlarge',
        'InstanceCount': 3,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
        'Ec2SubnetId': 'my-subnet-id',
        'Ec2KeyName': 'my-key',
    },
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

Notes: you'll have to create EMR_EC2_DefaultRole and EMR_DefaultRole. The Amazon documentation claims that JobFlowRole and ServiceRole are optional, but omitting them did not work for me. That could be because my subnet is a VPC subnet, but I'm not sure.

回答3:

I use the following code to create EMR with flink installed, and includes 3 instance groups. Reference document: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow

import boto3

masterInstanceType = 'm4.large'
coreInstanceType = 'c3.xlarge'
taskInstanceType = 'm4.large'
coreInstanceNum = 2
taskInstanceNum = 2
clusterName = 'my-emr-name'

emrClient = boto3.client('emr')

logUri = 's3://bucket/xxxxxx/'
releaseLabel = 'emr-5.17.0' #emr version
instances = {
    'Ec2KeyName': 'my_keyxxxxxx',
    'Ec2SubnetId': 'subnet-xxxxxx',
    'ServiceAccessSecurityGroup': 'sg-xxxxxx',
    'EmrManagedMasterSecurityGroup': 'sg-xxxxxx',
    'EmrManagedSlaveSecurityGroup': 'sg-xxxxxx',
    'KeepJobFlowAliveWhenNoSteps': True,
    'TerminationProtected': False,
    'InstanceGroups': [{
        'InstanceRole': 'MASTER',
        "InstanceCount": 1,
            "InstanceType": masterInstanceType,
            "Market": "SPOT",
            "Name": "Master"
        }, {
            'InstanceRole': 'CORE',
            "InstanceCount": coreInstanceNum,
            "InstanceType": coreInstanceType,
            "Market": "SPOT",
            "Name": "Core",
        }, {
            'InstanceRole': 'TASK',
            "InstanceCount": taskInstanceNum,
            "InstanceType": taskInstanceType,
            "Market": "SPOT",
            "Name": "Core",
        }
    ]
}
bootstrapActions = [{
    'Name': 'Log to Cloudwatch Logs',
    'ScriptBootstrapAction': {
        'Path': 's3://mybucket/bootstrap_cwl.sh'
    }
}, {
    'Name': 'Custom action',
    'ScriptBootstrapAction': {
        'Path': 's3://mybucket/install.sh'
    }
}]
applications = [{'Name': 'Flink'}]
serviceRole = 'EMR_DefaultRole'
jobFlowRole = 'EMR_EC2_DefaultRole'
tags = [{'Key': 'keyxxxxxx', 'Value': 'valuexxxxxx'},
        {'Key': 'key2xxxxxx', 'Value': 'value2xxxxxx'}
        ]
steps = [
    {
        'Name': 'Run Flink',
        'ActionOnFailure': 'TERMINATE_JOB_FLOW',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['flink', 'run',
                     '-m', 'yarn-cluster',
                     '-p', str(taskInstanceNum),
                     '-yjm', '1024',
                     '-ytm', '1024',
                     '/home/hadoop/test-1.0-SNAPSHOT.jar'
                     ]
        }
    },
]
response = emrClient.run_job_flow(
    Name=clusterName,
    LogUri=logUri,
    ReleaseLabel=releaseLabel,
    Instances=instances,
    Steps=steps,
    Configurations=configurations,
    BootstrapActions=bootstrapActions,
    Applications=applications,
    ServiceRole=serviceRole,
    JobFlowRole=jobFlowRole,
    Tags=tags
)

回答4:

My Step Arguments are: bash -c /usr/bin/flink run -m yarn-cluster -yn 2 /home/hadoop/mysflinkjob.jar

Trying execute same run_job_flow, but getting error:

Cannot run program "/usr/bin/flink run -m yarn-cluster -yn 2 /home/hadoop/mysflinkjob.jar" (in directory "."): error=2, No such file or directory

Executing same command from Master node working fine, but not from Python boto3

Seems like issue is due to quotation marks which EMR or boto3 add into Arguments.

UPDATE:

Split ALL your Arguments with white-space. I mean if you need to execute "flink run myflinkjob.jar" pass your Arguments as this list:

['flink','run','myflinkjob.jar']

来源：https://stackoverflow.com/questions/26314316/how-to-launch-and-configure-an-emr-cluster-using-boto

标签

python

amazon-web-services

boto

amazon-emr