elastic-map-reduce | 易学教程

AWS EMR S3DistCp: The auxService:mapreduce_shuffle does not exist

阅读更多关于 AWS EMR S3DistCp: The auxService:mapreduce_shuffle does not exist

问题 I am connected to an AWS EMR v5.4.0 instance over SSH and I want to call s3distcp. This link demonstrates how to setup an emr step to call it, but when I run it I get the following error: Container launch failed for container_1492469375740_0001_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance

Spark + EMR using Amazon's “maximizeResourceAllocation” setting does not use all cores/vcores

阅读更多关于 Spark + EMR using Amazon's “maximizeResourceAllocation” setting does not use all cores/vcores

问题 I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information". I'm running the cluster using m3.2xlarge instances for the worker nodes. I'm using a single m3.xlarge for the YARN master - the smallest

Drop all partitions from a hive table?

阅读更多关于 Drop all partitions from a hive table?

问题 How can I drop all partitions currently loaded in a Hive table? I can drop a single partition with alter table <table> drop partition(a=, b=...); I can load all partitions with the recover partitions statement. But I cannot seem to drop all partitions. I'm using the latest Hive version supported by EMR, 0.8.1. 回答1: As of version 0.9.0 you can use comparators in the drop partition statement which may be used to drop all partitions at once. An example, taken from the drop_partitions_filter.q

AWS EMR Error : All slaves in the job flow were terminated

阅读更多关于 AWS EMR Error : All slaves in the job flow were terminated

问题 I am using Elastic Mapreduce infrastructure on Amazon AWS. A jowflow got terminated automatically. Last state change reason according Amazon Console is : "All slaves in the job flow were terminated". Create jobflow command : elastic-mapreduce --create --name MyCluster --alive --instance-group master --instance-type m1.xlarge --instance-count 1 --bid-price 2.0 --instance-group core --instance-type m1.xlarge --instance-count 10 --bid-price 2.0 --hive-interactive --enable-debugging Details about

Amazon Web Service EMR FileSystem

阅读更多关于 Amazon Web Service EMR FileSystem

问题 I am trying to run a job on an AWS EMR cluster. The problem Im getting is the following: aws java.io.IOException: No FileSystem for scheme: hdfs I dont know where exactly my problem resides (in my java jar job or in the configurations of the job) In my S3 bucket Im making a folder (input) and in it im putting a bunch of files with my data. Then in the arguments Im giving the path for the input folder which then same path is used as the FileInputPath.getInputPath(args[0]). My question is -

Tool/Ways to schedule Amazon's Elastic MapReduce jobs

阅读更多关于 Tool/Ways to schedule Amazon's Elastic MapReduce jobs

问题 I use EMR to create new instances and process the jobs and then shutdown instances. My requirement is to schedule jobs in periodic fashion. One of the easy implementation can be to use quartz to trigger EMR jobs. But looking at longer run I am interested in using out of box mapreduce scheduling solution. My question is that is there any out of box scheduling feature provided by EMR or AWS-SDK, which i can use for my requirement? I can see there is scheduling in Auto scaling, but i want to

Downloading files from FTP to local using Java makes the file unreadable - encoding issues

阅读更多关于 Downloading files from FTP to local using Java makes the file unreadable - encoding issues

问题 I have a developed a code that reads very large files from FTP and writes it to local machine using Java. The code that does it is as follows . This is a part from the next(Text key, Text value) inside the RecordReader of the CustomInputFormat if(!processed) { System.out.println("in processed"); in = fs.open(file); processed=true; } while(bytesRead <= fileSize) { byte buf[] = new byte[1024]; try { in.read(buf); in.skip(1024); bytesRead+=1024; long diff = fileSize-bytesRead; if(diff<1024) {

Exception in thread “main” org.elasticsearch.client.transport.NoNodeAvailableException: No node available

阅读更多关于 Exception in thread “main” org.elasticsearch.client.transport.NoNodeAvailableException: No node available

问题 I am trying index using below Java code in elastic search.. I gave my machine Ip in the code .It is unable to connect to node. It is giving error like below : Exception in thread "main" org.elasticsearch.client.transport.NoNodeAvailableException: No node available at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:219) at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106) at org

How to know job flow id, other cluster parameters in script running via script-runner.jar

阅读更多关于 How to know job flow id, other cluster parameters in script running via script-runner.jar

问题 I'm starting an elastic mapreduce cluster with the following command-line: $ elastic-mapreduce \ --create \ --num-instances "${INSTANCES}" \ --instance-type m1.medium \ --ami-version 3.0.4 \ --name "${CLUSTER_NAME}" \ --log-uri "s3://my-bucket/elasticmapreduce/logs" \ --step-name "${STEP_NAME}" \ --step-action TERMINATE_JOB_FLOW \ --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \ --arg s3://my-bucket/log-parser/code/hadoop-script.sh \ --arg "${CLUSTER_NAME}" \ --arg "${STEP

Amazon Elastic MapReduce Bootstrap Actions not working

阅读更多关于 Amazon Elastic MapReduce Bootstrap Actions not working

问题 I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value mapred.map.child.java.opts=-Xmx1024m --mapred-key-value mapred.map.child.ulimit=unlimited -m mapred.map.child.java.opts=-Xmx1024m -m mapred.map.child.ulimit=unlimited -m mapred.child.java.opts=-Xmx1024m -m mapred.child.ulimit=unlimited What is the