I\'m trying to migrate a couple of MR jobs that I have written in python from AWS EMR 2.4 to AWS EMR 5.0. Till now I was using boto 2.4, but it doesn\'t support EMR 5.0, so
It's unfortunate that boto3 and EMR API are rather poorly documented. Minimally, the word counting example would look as follows:
import boto3
emr = boto3.client('emr')
resp = emr.run_job_flow(
Name='myjob',
ReleaseLabel='emr-5.0.0',
Instances={
'InstanceGroups': [
{'Name': 'master',
'InstanceRole': 'MASTER',
'InstanceType': 'c1.medium',
'InstanceCount': 1,
'Configurations': [
{'Classification': 'yarn-site',
'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
{'Name': 'core',
'InstanceRole': 'CORE',
'InstanceType': 'c1.medium',
'InstanceCount': 1,
'Configurations': [
{'Classification': 'yarn-site',
'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
]},
Steps=[
{'Name': 'My word count example',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'hadoop-streaming',
'-files', 's3://mybucket/wordSplitter.py#wordSplitter.py',
'-mapper', 'python2.7 wordSplitter.py',
'-input', 's3://mybucket/input/',
'-output', 's3://mybucket/output/',
'-reducer', 'aggregate']}
}
],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
)
I don't remember needing to do this with boto, but I have had issues running the simple streaming job properly without disabling vmem-check-enabled
.
Also, if your script is located somewhere on S3, download it using -files
(appending #filename
to the argument make the downloaded file available as filename
in the cluster).