amazon-emr

AWS EMR how to use shell script as bootstrap action?

跟風遠走 提交于 2020-01-04 06:27:14
问题 I need to be able to use Java 8 in EMR I have found this post https://crazydoc1.wordpress.com/2015/08/23/java-8-on-amazon-emr-ami-4-0-0/ Which provides a bootstrap shell script https://gist.github.com/pstorch/c217d8324c4133a003c4 Which installs java 8. When looking at documentation on how to use bootstrap scripts its not apparent at all how to use a shell script with bootstrap actions since in documentation it asks for a Jar location (https://docs.aws.amazon.com/ElasticMapReduce/latest

AWS EMR 4.0 - How can I add a custom JAR step to run shell commands

跟風遠走 提交于 2020-01-04 05:58:36
问题 I am trying to run shell commands using steps on EMR 4.0.0 and used this link for reference - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html But I want to know from where to put 'command-runner.jar' in 'JAR location' field http://i.stack.imgur.com/CRicz.png I kept 'command-runner.jar' in AWS s3 and tried to load it from that location and in 'Arguments' gave s3 location of my 'example.sh' file and after adding step it failed giving this exception

How to access statistics endpoint for a Spark Streaming application?

寵の児 提交于 2020-01-03 03:09:08
问题 As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs. I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode. When I hit the endpoint for my streaming jobs, all it gives me is the error message: no streaming listener attached to <stream name> I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working? This

AWS EMR Spark “No Module named pyspark”

亡梦爱人 提交于 2020-01-03 02:49:05
问题 I created a spark cluster, ssh into the master, and launch the shell: MASTER=yarn-client ./spark/bin/pyspark When I do the following: x = sc.textFile("s3://location/files.*") xt = x.map(lambda x: handlejson(x)) table= sqlctx.inferSchema(xt) I get the following error: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar java.io.EOFException java.io

External checkpoints to S3 on EMR

放肆的年华 提交于 2020-01-02 22:07:25
问题 I am trying to deploy a production cluster for my Flink program. I am using a standard hadoop-core EMR cluster with Flink 1.3.2 installed, using YARN to run it. I am trying to configure my RocksDB to write my checkpoints to an S3 bucket. I am trying to go through these docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html#set-s3-filesystem. The problem seems to be getting the dependencies working correctly. I receive this error when trying run the program: java.lang

Flink TaskManagers do not start until job is submitted in YARN cluster

偶尔善良 提交于 2020-01-02 18:24:30
问题 I am using Amazon EMR to run Flink Cluster on YARN. My setup consists of m4.large instances for 1 master and 2 core nodes. I have started the Flink CLuster on YARN with the command: flink-yarn-session -n 2 -d -tm 4096 -s 4 . Flink Job Manager and Application Manager starts but there are no Task Managers running. The Flink Web interface shows 0 for task managers, task slots and slots available. However when I submit a job to flink cluster, then Task Managers get allocated and the job runs and

how to set livy.server.session.timeout on EMR cluster boostrap?

大憨熊 提交于 2020-01-02 08:09:07
问题 I am creating an EMR cluster, and using jupyter notebook to run some spark tasks. My tasks die after approximately 1 hour of execution, and the error is: An error was encountered: Invalid status code '400' from https://xxx.xx.x.xxx:18888/sessions/0/statements/20 with error payload: "requirement failed: Session isn't active." My understanding is that it is related to the Livy config livy.server.session.timeout , but I don't know how I can set it in the bootstrap of the cluster (I need to do it

AWS EMR performance HDFS vs S3

删除回忆录丶 提交于 2020-01-01 11:34:42
问题 In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS. Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it

AWS EMR performance HDFS vs S3

依然范特西╮ 提交于 2020-01-01 11:34:09
问题 In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS. Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it

Amazon Elastic MapReduce Bootstrap Actions not working

谁说我不能喝 提交于 2020-01-01 06:51:06
问题 I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value mapred.map.child.java.opts=-Xmx1024m --mapred-key-value mapred.map.child.ulimit=unlimited -m mapred.map.child.java.opts=-Xmx1024m -m mapred.map.child.ulimit=unlimited -m mapred.child.java.opts=-Xmx1024m -m mapred.child.ulimit=unlimited What is the