amazon-emr | 易学教程

Hive query shows few reducers killed but query is still running. Will the output be proper?

阅读更多关于 Hive query shows few reducers killed but query is still running. Will the output be proper?

问题 I have a complex query with multiple left outer joins running for the last 1 hour in Amazon AWS EMR. But few reducers are shown as Failed and Killed. My question is why do some reducers get killed? Will the final output be proper? 回答1: Usually each container has 3 attempts before final fail (configurable, as @rbyndoor mentioned). If one attempt has failed, it is being restarted until the number of attempts reaches limit, and if it is failed, the whole vertex is failed, all other tasks being

How to run 2 EMR Spark Step Concurrently?

阅读更多关于 How to run 2 EMR Spark Step Concurrently?

问题 I am trying to have 2 steps run concurrent in EMR. However I always get the first step running and the second pending. Part of my Yarn configuration is as follows: { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator", "yarn.scheduler.capacity.maximum-am-resource-percent": "0.5" } } When I run on my local Mac I am able to run the 2 application on Yarn with similar configuration

Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

阅读更多关于 Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist

问题 I'm building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to EMRFS/S3 in parquet format. Console sink works as expected, file sink does not work. In spark-shell : val event = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", <server list>) .option("subscribe", <topic>) .load() val eventdf = event.select($"value" cast "string" as "json") .select(from_json($"json",

Downloading files from FTP to local using Java makes the file unreadable - encoding issues

阅读更多关于 Downloading files from FTP to local using Java makes the file unreadable - encoding issues

问题 I have a developed a code that reads very large files from FTP and writes it to local machine using Java. The code that does it is as follows . This is a part from the next(Text key, Text value) inside the RecordReader of the CustomInputFormat if(!processed) { System.out.println("in processed"); in = fs.open(file); processed=true; } while(bytesRead <= fileSize) { byte buf[] = new byte[1024]; try { in.read(buf); in.skip(1024); bytesRead+=1024; long diff = fileSize-bytesRead; if(diff<1024) {

Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)?

阅读更多关于 Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)?

问题 Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)? I am using emr-5.27.0. 回答1: You can submit some script as a step , not a bootstrap. For example, I made an SSL certificate update script and it is applied to the EMR by a step. This is a part of my lambda function in Python language. But you can add this step by manually on the console, or other languages. Steps=[{ 'Name': 'PrestoCertificate', 'ActionOnFailure': 'CONTINUE', 'HadoopJarStep':

Using GroupBy while copying from HDFS to S3 to merge files within a folder

阅读更多关于 Using GroupBy while copying from HDFS to S3 to merge files within a folder

问题 I have the following folders in HDFS : hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101 hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101 hdfs://x.x

EMR creation task and core nodes not able to specify as “Max on demand” for spot pricing

阅读更多关于 EMR creation task and core nodes not able to specify as “Max on demand” for spot pricing

问题 core_instance_group { instance_type = "c4.large" instance_count = 1 ebs_config { size = "40" type = "gp2" volumes_per_instance = 1 } bid_price = "0.30" I would require the bid_price = "max on-demand". Not sure how to pass this parameter in terraform. 回答1: I figured out a way. However, included couple of scripts to fetch the price details. Something like this: AWS price cmd: InstanceType=$1 aws pricing get-products --filters Type=TERM_MATCH,Field=instanceType,Value=${InstanceType} Type=TERM

EMR creation task and core nodes not able to specify as “Max on demand” for spot pricing

阅读更多关于 EMR creation task and core nodes not able to specify as “Max on demand” for spot pricing

Copy files from S3 to EMR local using Lambda

阅读更多关于 Copy files from S3 to EMR local using Lambda

问题 I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda. S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop. Is there a programmatic way using boto3 in Lambda to copy from S3 to Emr's local dir? 回答1: I wrote a test Lambda function to submit a job step to EMR that copies files from S3 to EMR's local dir. This worked. emrclient = boto3.client('emr', region_name='us-west-2')

Copy files from S3 to EMR local using Lambda

阅读更多关于 Copy files from S3 to EMR local using Lambda