amazon-emr | 易学教程

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

阅读更多关于 AWS Athena concurrency limits: Number of submitted queries VS number of running queries

问题 According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being

Pyspark - Load file: Path does not exist

阅读更多关于 Pyspark - Load file: Path does not exist

问题 I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one: spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()\ df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True) When I run the script raises the following error message: pyspark.sql.utils.AnalysisException: u'Path does not exist:

Optimizing GC on EMR cluster

阅读更多关于 Optimizing GC on EMR cluster

问题 I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K-

Optimizing GC on EMR cluster

阅读更多关于 Optimizing GC on EMR cluster

How do I make matplotlib work in AWS EMR Jupyter notebook?

阅读更多关于 How do I make matplotlib work in AWS EMR Jupyter notebook?

问题 This is very close to this question, but I have added a few details specific to my question: Matplotlib Plotting using AWS-EMR jupyter notebook I would like to find a way to use matplotlib inside my Jupyter notebook. Here is the code-snippet in error, it's fairly simple: notebook import matplotlib matplotlib.use("agg") import matplotlib.pyplot as plt plt.plot([1,2,3,4]) plt.show() I chose this snippet because this line alone fails as it tries to use TKinter (which is not installed on an AWS

How to launch and configure an EMR cluster using boto

阅读更多关于 How to launch and configure an EMR cluster using boto

问题 I'm trying to launch a cluster and run a job all using boto. I find lot's of examples of creating job_flows. But I can't for the life of me, find an example that shows: How to define the cluster to be used (by clusted_id) How to configure an launch a cluster (for example, If I want to use spot instances for some task nodes) Am I missing something? 回答1: Boto and the underlying EMR API is currently mixing the terms cluster and job flow , and job flow is being deprecated. I consider them

How do you make a HIVE table out of JSON data?

阅读更多关于 How do you make a HIVE table out of JSON data?

问题 I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table? Does anyone have some example command to get me started, I can't find anything useful with Google ... 回答1: You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table. A really good

How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

阅读更多关于 How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?

问题 I would like to upgrade my AWS data pipeline definition to EMR 4.x or 5.x, so I can take advantage of Hive's latest features (version 2.0+), such as CURRENT_DATE and CURRENT_TIMESTAMP , etc. The change from EMR 3.x to 4.x/5.x requires the use of releaseLabel in EmrCluster , versus amiVersion . When I use a "releaseLabel": "emr-4.1.0" , I get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask Below is my data pipeline definition, for EMR

Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

阅读更多关于 Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

问题 I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead. Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

阅读更多关于 “Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

问题 I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error: 16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container