amazon-emr | 易学教程

AWS EMR how to use shell script as bootstrap action?

阅读更多关于 AWS EMR how to use shell script as bootstrap action?

问题 I need to be able to use Java 8 in EMR I have found this post https://crazydoc1.wordpress.com/2015/08/23/java-8-on-amazon-emr-ami-4-0-0/ Which provides a bootstrap shell script https://gist.github.com/pstorch/c217d8324c4133a003c4 Which installs java 8. When looking at documentation on how to use bootstrap scripts its not apparent at all how to use a shell script with bootstrap actions since in documentation it asks for a Jar location (https://docs.aws.amazon.com/ElasticMapReduce/latest

AWS EMR 4.0 - How can I add a custom JAR step to run shell commands

阅读更多关于 AWS EMR 4.0 - How can I add a custom JAR step to run shell commands

问题 I am trying to run shell commands using steps on EMR 4.0.0 and used this link for reference - http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html But I want to know from where to put 'command-runner.jar' in 'JAR location' field http://i.stack.imgur.com/CRicz.png I kept 'command-runner.jar' in AWS s3 and tried to load it from that location and in 'Arguments' gave s3 location of my 'example.sh' file and after adding step it failed giving this exception

How to access statistics endpoint for a Spark Streaming application?

阅读更多关于 How to access statistics endpoint for a Spark Streaming application?

问题 As of Spark 2.2.0, there's are new endpoints in the API for getting information about streaming jobs. I run Spark on EMR clusters, using Spark 2.2.0 in cluster mode. When I hit the endpoint for my streaming jobs, all it gives me is the error message: no streaming listener attached to <stream name> I've dug through the Spark codebase a bit, but this feature is not very well documented. So I'm curious if this is a bug? Is there some configuration I need to do to get this endpoint working? This

AWS EMR Spark “No Module named pyspark”

阅读更多关于 AWS EMR Spark “No Module named pyspark”

问题 I created a spark cluster, ssh into the master, and launch the shell: MASTER=yarn-client ./spark/bin/pyspark When I do the following: x = sc.textFile("s3://location/files.*") xt = x.map(lambda x: handlejson(x)) table= sqlctx.inferSchema(xt) I get the following error: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar java.io.EOFException java.io

External checkpoints to S3 on EMR

阅读更多关于 External checkpoints to S3 on EMR

问题 I am trying to deploy a production cluster for my Flink program. I am using a standard hadoop-core EMR cluster with Flink 1.3.2 installed, using YARN to run it. I am trying to configure my RocksDB to write my checkpoints to an S3 bucket. I am trying to go through these docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html#set-s3-filesystem. The problem seems to be getting the dependencies working correctly. I receive this error when trying run the program: java.lang

Flink TaskManagers do not start until job is submitted in YARN cluster

阅读更多关于 Flink TaskManagers do not start until job is submitted in YARN cluster

问题 I am using Amazon EMR to run Flink Cluster on YARN. My setup consists of m4.large instances for 1 master and 2 core nodes. I have started the Flink CLuster on YARN with the command: flink-yarn-session -n 2 -d -tm 4096 -s 4 . Flink Job Manager and Application Manager starts but there are no Task Managers running. The Flink Web interface shows 0 for task managers, task slots and slots available. However when I submit a job to flink cluster, then Task Managers get allocated and the job runs and

how to set livy.server.session.timeout on EMR cluster boostrap?

阅读更多关于 how to set livy.server.session.timeout on EMR cluster boostrap?

问题 I am creating an EMR cluster, and using jupyter notebook to run some spark tasks. My tasks die after approximately 1 hour of execution, and the error is: An error was encountered: Invalid status code '400' from https://xxx.xx.x.xxx:18888/sessions/0/statements/20 with error payload: "requirement failed: Session isn't active." My understanding is that it is related to the Livy config livy.server.session.timeout , but I don't know how I can set it in the bootstrap of the cluster (I need to do it

AWS EMR performance HDFS vs S3

阅读更多关于 AWS EMR performance HDFS vs S3

问题 In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS. Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it

AWS EMR performance HDFS vs S3

阅读更多关于 AWS EMR performance HDFS vs S3

Amazon Elastic MapReduce Bootstrap Actions not working

阅读更多关于 Amazon Elastic MapReduce Bootstrap Actions not working

问题 I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work: --mapred-key-value mapred.child.java.opts=-Xmx1024m --mapred-key-value mapred.child.ulimit=unlimited --mapred-key-value mapred.map.child.java.opts=-Xmx1024m --mapred-key-value mapred.map.child.ulimit=unlimited -m mapred.map.child.java.opts=-Xmx1024m -m mapred.map.child.ulimit=unlimited -m mapred.child.java.opts=-Xmx1024m -m mapred.child.ulimit=unlimited What is the