emr

s3-dist-cp and hadoop distcp job infinitely loopin in EMR

喜夏-厌秋 提交于 2019-12-25 07:44:58
问题 I'm trying to copy 193 GB data from s3 to HDFS. I'm running the following commands for s3-dist-cp and hadoop distcp: s3-dist-cp --src s3a://PathToFile/file1 --dest hdfs:///user/hadoop/S3CopiedFiles/ hadoop distcp s3a://PathToFile/file1 hdfs:///user/hadoop/S3CopiedFiles/ I'm running these on the master node and also keeping a check on the amount being transferred. It took about an hour and after copying it over, everything gets erased, disk space is shown as 99.8% in the 4 core instances in my

Load props file in EMR Spark Application

点点圈 提交于 2019-12-24 02:13:44
问题 I am trying to load custom properties in my spark application using :- command-runner.jar,spark-submit,--deploy-mode,cluster,--properties-file,s3://spark-config-test/myprops.conf,--num-executors,5,--executor-cores,2,--class,com.amazon.Main,#{input.directoryPath}/SWALiveOrderModelSpark-1.0-super.jar However, I am getting the following exception:- Exception in thread "main" java.lang.IllegalArgumentException: Invalid properties file 's3://spark-config-test/myprops.conf''. at org.apache.spark

SPARK : OutOfMemoryError: Requested array size exceeds VM limit

孤者浪人 提交于 2019-12-23 23:16:05
问题 I am running a spark job on a EMR Cluster (A master with 10 slaves) of type r3.8xLarge: spark.driver.cores 30 spark.driver.memory 200g spark.executor.cores 16 spark.executor.instances 40 spark.executor.memory 60g spark.storage.memoryFraction 0.95 spark.sql.shuffle.partitions 2400 spark.default.parallelism 2400 spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:

Hive partition pruning on computed column

巧了我就是萌 提交于 2019-12-23 17:18:55
问题 I have a few tables on Hive and my query is trying to retrieve the data for the past x days. Hive is pruning the partitions when I use a direct date, but is doing a full table scan when using a formula instead. select * from f_event where date_key > 20160101; scanned partitions.. s3://...key=20160102 [f] s3://...key=20160103 [f] s3://...key=20160104 [f] If I use a formula, say, to get the past 4 weeks of data Select count(*) From f_event f Where date_key > from_unixtime(unix_timestamp()-2*7

Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster

喜你入骨 提交于 2019-12-23 08:06:13
问题 The variables I share via broadcast are null in the cluster. My application is quite complex, but I have written this small example that works flawlessly when I run it locally, but it fails in the cluster: package com.gonzalopezzi.bigdata.bicing import org.apache.spark.broadcast.Broadcast import org.apache.spark.rdd.RDD import org.apache.spark.{SparkContext, SparkConf} object PruebaBroadcast2 extends App { val conf = new SparkConf().setAppName("PruebaBroadcast2") val sc = new SparkContext

Spark broadcasted variable returns NullPointerException when run in Amazon EMR cluster

流过昼夜 提交于 2019-12-23 08:06:05
问题 The variables I share via broadcast are null in the cluster. My application is quite complex, but I have written this small example that works flawlessly when I run it locally, but it fails in the cluster: package com.gonzalopezzi.bigdata.bicing import org.apache.spark.broadcast.Broadcast import org.apache.spark.rdd.RDD import org.apache.spark.{SparkContext, SparkConf} object PruebaBroadcast2 extends App { val conf = new SparkConf().setAppName("PruebaBroadcast2") val sc = new SparkContext

Use bootstrap to replace default jar on EMR

烈酒焚心 提交于 2019-12-23 03:49:26
问题 I am on a EMR cluster with AMI 3.0.4. Once the cluster is up, I ssh to master and did the following manually: cd /home/hadoop/share/hadoop/common/lib/ rm guava-11.0.2.jar wget http://central.maven.org/maven2/com/google/guava/guava/14.0.1/guava-14.0.1.jar chmod 777 guava-14.0.1.jar Is it possible to do above in a bootstrap action? Thanks! 回答1: With EMR 4.0 the hadoop installation path changed. So the manual update of guava-14.0.1.jar must be changed to: cd /usr/lib/hadoop/lib sudo wget http:/

Amazon EMR: Configuring storage on data nodes

别说谁变了你拦得住时间么 提交于 2019-12-22 11:05:20
问题 I'm using Amazon EMR and I'm able to run most jobs fine. I'm running into a problem when I start loading and generating more data within the EMR cluster. The cluster runs out of storage space. Each data node is a c1.medium instance. According to the links here and here each data node should come with 350GB of instance storage. Through the ElasticMapReduce Slave security group I've been able to verify in my AWS Console that the c1.medium data nodes are running and are instance stores. When I

how to restart hadoop cluster on emr

不羁的心 提交于 2019-12-22 10:28:58
问题 I have a hadoop installation on the Amazon Elastic MapReduce , whenever I try to restart the cluster I get the following error: /stop-all.sh no jobtracker to stop The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is Are you sure you want to continue connecting (yes/no)? yes localhost: Warning: Permanently added 'localhost' (RSA) to the list of known hosts. localhost: Permission denied (publickey). no namenode to stop localhost: Permission denied (publickey).

How to set spark.driver.memory for Spark/Zeppelin on EMR

£可爱£侵袭症+ 提交于 2019-12-22 07:00:03
问题 When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work. I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create clusters? Is Bootstrap action could be a solution? If yes, can you please provide an example of how the bootstrap action file should look like? 回答1: You can always try to add the following configuration on job flow/cluster creation : [ { "Classification":