amazon-emr

Hive query shows few reducers killed but query is still running. Will the output be proper?

∥☆過路亽.° 提交于 2020-01-16 08:36:33
问题 I have a complex query with multiple left outer joins running for the last 1 hour in Amazon AWS EMR. But few reducers are shown as Failed and Killed. My question is why do some reducers get killed? Will the final output be proper? 回答1: Usually each container has 3 attempts before final fail (configurable, as @rbyndoor mentioned). If one attempt has failed, it is being restarted until the number of attempts reaches limit, and if it is failed, the whole vertex is failed, all other tasks being

How to run 2 EMR Spark Step Concurrently?

吃可爱长大的小学妹 提交于 2020-01-11 02:31:30
问题 I am trying to have 2 steps run concurrent in EMR. However I always get the first step running and the second pending. Part of my Yarn configuration is as follows: { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator", "yarn.scheduler.capacity.maximum-am-resource-percent": "0.5" } } When I run on my local Mac I am able to run the 2 application on Yarn with similar configuration

Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

自作多情 提交于 2020-01-10 23:29:21
问题 I'm building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to EMRFS/S3 in parquet format. Console sink works as expected, file sink does not work. In spark-shell : val event = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", <server list>) .option("subscribe", <topic>) .load() val eventdf = event.select($"value" cast "string" as "json") .select(from_json($"json",

Downloading files from FTP to local using Java makes the file unreadable - encoding issues

二次信任 提交于 2020-01-07 02:35:08
问题 I have a developed a code that reads very large files from FTP and writes it to local machine using Java. The code that does it is as follows . This is a part from the next(Text key, Text value) inside the RecordReader of the CustomInputFormat if(!processed) { System.out.println("in processed"); in = fs.open(file); processed=true; } while(bytesRead <= fileSize) { byte buf[] = new byte[1024]; try { in.read(buf); in.skip(1024); bytesRead+=1024; long diff = fileSize-bytesRead; if(diff<1024) {

Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)?

随声附和 提交于 2020-01-06 05:50:08
问题 Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)? I am using emr-5.27.0. 回答1: You can submit some script as a step , not a bootstrap. For example, I made an SSL certificate update script and it is applied to the EMR by a step. This is a part of my lambda function in Python language. But you can add this step by manually on the console, or other languages. Steps=[{ 'Name': 'PrestoCertificate', 'ActionOnFailure': 'CONTINUE', 'HadoopJarStep':

Using GroupBy while copying from HDFS to S3 to merge files within a folder

谁说我不能喝 提交于 2020-01-05 08:48:09
问题 I have the following folders in HDFS : hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101 hdfs://x.x

EMR creation task and core nodes not able to specify as “Max on demand” for spot pricing

旧巷老猫 提交于 2020-01-05 05:03:11
问题 core_instance_group { instance_type = "c4.large" instance_count = 1 ebs_config { size = "40" type = "gp2" volumes_per_instance = 1 } bid_price = "0.30" I would require the bid_price = "max on-demand". Not sure how to pass this parameter in terraform. 回答1: I figured out a way. However, included couple of scripts to fetch the price details. Something like this: AWS price cmd: InstanceType=$1 aws pricing get-products --filters Type=TERM_MATCH,Field=instanceType,Value=${InstanceType} Type=TERM

EMR creation task and core nodes not able to specify as “Max on demand” for spot pricing

时间秒杀一切 提交于 2020-01-05 05:03:10
问题 core_instance_group { instance_type = "c4.large" instance_count = 1 ebs_config { size = "40" type = "gp2" volumes_per_instance = 1 } bid_price = "0.30" I would require the bid_price = "max on-demand". Not sure how to pass this parameter in terraform. 回答1: I figured out a way. However, included couple of scripts to fetch the price details. Something like this: AWS price cmd: InstanceType=$1 aws pricing get-products --filters Type=TERM_MATCH,Field=instanceType,Value=${InstanceType} Type=TERM

Copy files from S3 to EMR local using Lambda

被刻印的时光 ゝ 提交于 2020-01-05 04:57:28
问题 I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda. S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop. Is there a programmatic way using boto3 in Lambda to copy from S3 to Emr's local dir? 回答1: I wrote a test Lambda function to submit a job step to EMR that copies files from S3 to EMR's local dir. This worked. emrclient = boto3.client('emr', region_name='us-west-2')

Copy files from S3 to EMR local using Lambda

对着背影说爱祢 提交于 2020-01-05 04:57:06
问题 I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda. S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop. Is there a programmatic way using boto3 in Lambda to copy from S3 to Emr's local dir? 回答1: I wrote a test Lambda function to submit a job step to EMR that copies files from S3 to EMR's local dir. This worked. emrclient = boto3.client('emr', region_name='us-west-2')