amazon-emr | 易学教程

Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

阅读更多关于 Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

问题 I'm trying to run Spark on EMR using the SDK for Java, but I'm having issues getting the spark-submit to use a JAR that I have stored on S3. Here is the relevant code: public String launchCluster() throws Exception { StepFactory stepFactory = new StepFactory(); // Creates a cluster flow step for debugging StepConfig enableDebugging = new StepConfig().withName("Enable debugging") .withActionOnFailure("TERMINATE_JOB_FLOW") .withHadoopJarStep(stepFactory.newEnableDebuggingStep()); // Here is the

s3 parquet write - too many partitions, slow writing

阅读更多关于 s3 parquet write - too many partitions, slow writing

问题 I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running. output.write.mode("append").partitionBy("id")

Spark Step in AWS EMR fails with exitCode 13

阅读更多关于 Spark Step in AWS EMR fails with exitCode 13

问题 I'm experimenting with EMR a bit I try to run a very simple spark programm from pyspark.sql.types import IntegerType mylist = [1, 2, 3, 4] df = spark.createDataFrame(mylist, IntegerType()).show() df.write.parquet('/path/to/save', mode='overwrite') I launch the app by adding a step in the AWS EMR web-console I select the app from s3 select deploy mode cluster and leave the rest blank. The app doesn't even launch probably because I get the following error code: Application application

How to access custom UDFs through Spark Thrift Server?

阅读更多关于 How to access custom UDFs through Spark Thrift Server?

问题 I am running Spark Thrift Server on EMR. I start up the Spark Thrift Server by: sudo -u spark /usr/lib/spark/sbin/start-thriftserver.sh --queue interactive.thrift --jars /opt/lib/custom-udfs.jar Notice that I have a customer UDF jar and I want to add it to the Thrift Server classpath, so I added --jars /opt/lib/custom-udfs.jar in the above command. Once I am in my EMR, I issued the following to connect to the Spark Thrift Server. beeline -u jdbc:hive2://localhost:10000/default Then I was able

Spark UI on AWS EMR

阅读更多关于 Spark UI on AWS EMR

问题 I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop@EMR_DNS 1) How do I find out what the Spark WebUI's assigned port is? 2) How do I verify the Spark WebUI is running? 回答1: Spark on EMR is configured for YARN, thus the

Backup DynamoDB Table with dynamic columns to S3

阅读更多关于 Backup DynamoDB Table with dynamic columns to S3

问题 I have read several other posts about this and in particular this question with an answer by greg about how to do it in Hive. I would like to know how to account for DynamoDB tables with variable amounts of columns though? That is, the original DynamoDB table has rows that were added dynamically with different columns. I have tried to view the exportDynamoDBToS3 script that Amazon uses in their DataPipeLine service but it has code like the following which does not seem to map the columns: --

What's the aws cli command to create the default EMR-managed security groups?

阅读更多关于 What's the aws cli command to create the default EMR-managed security groups?

问题 When using the EMR web console, you can create a cluster and AWS automatically creates the EMR-managed security groups named "ElasticMapReduce-master" & "ElasticMapReduce-slave". How do you create those via the aws cli? I found aws emr create-default-roles but there's no aws emr create-default-security-groups . 回答1: As of right now, it looks like you can't. See http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-man-sec-groups.html section "To specify Amazon EMR–managed security groups

Deduce the HDFS path at runtime on EMR

阅读更多关于 Deduce the HDFS path at runtime on EMR

问题 I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip. The first EMR step is: hadoop fs -mkdir /input - This step completed successfully. The second EMR step is: Following is the command I am using: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED I get the following exception Error

Getting “Existing lock /var/run/yum.pid: another copy is running as pid …” during bootstraping in EMR

阅读更多关于 Getting “Existing lock /var/run/yum.pid: another copy is running as pid …” during bootstraping in EMR

问题 I need to install python3 in my EMR cluster (AMI 3.1.1) as a part of bootstraping step. So I added the following command: sudo yum install -y python3 But everytime I got an error saying the following: Existing lock /var/run/yum.pid: another copy is running as pid 1829. Another app is currently holding the yum lock; waiting for it to exit... The other application is: yum How can I avoid this error? Or is there a way to install Python 3 without going through this route? 回答1: The issue is that

Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

阅读更多关于 Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

问题 I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format ------------------------------- COLOR=Black Date=1349719200 PID=23898 Program=Java EOE ------------------------------- COLOR=White Date=1349719234 PID=23828 Program=Python EOE So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty