elastic-map-reduce

AWS EMR S3DistCp: The auxService:mapreduce_shuffle does not exist

久未见 提交于 2021-02-07 20:40:23
问题 I am connected to an AWS EMR v5.4.0 instance over SSH and I want to call s3distcp. This link demonstrates how to setup an emr step to call it, but when I run it I get the following error: Container launch failed for container_1492469375740_0001_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance

Spark + EMR using Amazon's “maximizeResourceAllocation” setting does not use all cores/vcores

强颜欢笑 提交于 2020-08-20 18:01:06
问题 I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information". I'm running the cluster using m3.2xlarge instances for the worker nodes. I'm using a single m3.xlarge for the YARN master - the smallest

Drop all partitions from a hive table?

蓝咒 提交于 2020-07-17 09:48:25
问题 How can I drop all partitions currently loaded in a Hive table? I can drop a single partition with alter table <table> drop partition(a=, b=...); I can load all partitions with the recover partitions statement. But I cannot seem to drop all partitions. I'm using the latest Hive version supported by EMR, 0.8.1. 回答1: As of version 0.9.0 you can use comparators in the drop partition statement which may be used to drop all partitions at once. An example, taken from the drop_partitions_filter.q

AWS EMR Error : All slaves in the job flow were terminated

折月煮酒 提交于 2020-07-10 06:37:33
问题 I am using Elastic Mapreduce infrastructure on Amazon AWS. A jowflow got terminated automatically. Last state change reason according Amazon Console is : "All slaves in the job flow were terminated". Create jobflow command : elastic-mapreduce --create --name MyCluster --alive --instance-group master --instance-type m1.xlarge --instance-count 1 --bid-price 2.0 --instance-group core --instance-type m1.xlarge --instance-count 10 --bid-price 2.0 --hive-interactive --enable-debugging Details about

Amazon Web Service EMR FileSystem

一个人想着一个人 提交于 2020-01-25 01:06:29
问题 I am trying to run a job on an AWS EMR cluster. The problem Im getting is the following: aws java.io.IOException: No FileSystem for scheme: hdfs I dont know where exactly my problem resides (in my java jar job or in the configurations of the job) In my S3 bucket Im making a folder (input) and in it im putting a bunch of files with my data. Then in the arguments Im giving the path for the input folder which then same path is used as the FileInputPath.getInputPath(args[0]). My question is -

Tool/Ways to schedule Amazon's Elastic MapReduce jobs

…衆ロ難τιáo~ 提交于 2020-01-24 10:26:12
问题 I use EMR to create new instances and process the jobs and then shutdown instances. My requirement is to schedule jobs in periodic fashion. One of the easy implementation can be to use quartz to trigger EMR jobs. But looking at longer run I am interested in using out of box mapreduce scheduling solution. My question is that is there any out of box scheduling feature provided by EMR or AWS-SDK, which i can use for my requirement? I can see there is scheduling in Auto scaling, but i want to

Downloading files from FTP to local using Java makes the file unreadable - encoding issues

二次信任 提交于 2020-01-07 02:35:08
问题 I have a developed a code that reads very large files from FTP and writes it to local machine using Java. The code that does it is as follows . This is a part from the next(Text key, Text value) inside the RecordReader of the CustomInputFormat if(!processed) { System.out.println("in processed"); in = fs.open(file); processed=true; } while(bytesRead <= fileSize) { byte buf[] = new byte[1024]; try { in.read(buf); in.skip(1024); bytesRead+=1024; long diff = fileSize-bytesRead; if(diff<1024) {

Exception in thread “main” org.elasticsearch.client.transport.NoNodeAvailableException: No node available

拈花ヽ惹草 提交于 2020-01-06 14:41:33
问题 I am trying index using below Java code in elastic search.. I gave my machine Ip in the code .It is unable to connect to node. It is giving error like below : Exception in thread "main" org.elasticsearch.client.transport.NoNodeAvailableException: No node available at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:219) at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106) at org

How to know job flow id, other cluster parameters in script running via script-runner.jar

家住魔仙堡 提交于 2020-01-02 18:37:36
问题 I'm starting an elastic mapreduce cluster with the following command-line: $ elastic-mapreduce \ --create \ --num-instances "${INSTANCES}" \ --instance-type m1.medium \ --ami-version 3.0.4 \ --name "${CLUSTER_NAME}" \ --log-uri "s3://my-bucket/elasticmapreduce/logs" \ --step-name "${STEP_NAME}" \ --step-action TERMINATE_JOB_FLOW \ --jar s3://elasticmapreduce/libs/script-runner/script-runner.jar \ --arg s3://my-bucket/log-parser/code/hadoop-script.sh \ --arg "${CLUSTER_NAME}" \ --arg "${STEP

Amazon Elastic MapReduce Bootstrap Actions not working

谁说我不能喝 提交于 2020-01-01 06:51:06
问题 I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value mapred.map.child.java.opts=-Xmx1024m --mapred-key-value mapred.map.child.ulimit=unlimited -m mapred.map.child.java.opts=-Xmx1024m -m mapred.map.child.ulimit=unlimited -m mapred.child.java.opts=-Xmx1024m -m mapred.child.ulimit=unlimited What is the