问题
I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda.
S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop.
Is there a programmatic way using boto3 in Lambda to copy from S3 to Emr's local dir?
回答1:
I wrote a test Lambda function to submit a job step to EMR that copies files from S3 to EMR's local dir. This worked.
emrclient = boto3.client('emr', region_name='us-west-2')
def lambda_handler(event, context):
EMRS = emrclient.list_clusters( ClusterStates = ['STARTING', 'RUNNING', 'WAITING'] )
clusters = EMRS["Clusters"]
print(clusters)
for cluster in clusters:
ID = cluster["Id"]
response = emrclient.add_job_flow_steps(JobFlowId=ID,
Steps=[
{
'Name': 'AWS S3 Copy',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args':["aws","s3","cp","s3://XXX/","/home/hadoop/copy/","--recursive"],
}
}
],
)
If there are better ways to do the copy, please do let me know.
回答2:
That would need a way for the AWS Lambda function to remotely trigger the CopyToLocal
command on the cluster.
The Lambda function could call add-steps to request the cluster to run a script that does this action.
来源:https://stackoverflow.com/questions/56623774/copy-files-from-s3-to-emr-local-using-lambda