amazon-emr

Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

99封情书 提交于 2019-12-13 03:56:50
问题 I'm trying to run Spark on EMR using the SDK for Java, but I'm having issues getting the spark-submit to use a JAR that I have stored on S3. Here is the relevant code: public String launchCluster() throws Exception { StepFactory stepFactory = new StepFactory(); // Creates a cluster flow step for debugging StepConfig enableDebugging = new StepConfig().withName("Enable debugging") .withActionOnFailure("TERMINATE_JOB_FLOW") .withHadoopJarStep(stepFactory.newEnableDebuggingStep()); // Here is the

s3 parquet write - too many partitions, slow writing

白昼怎懂夜的黑 提交于 2019-12-13 02:47:41
问题 I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running. output.write.mode("append").partitionBy("id")

Spark Step in AWS EMR fails with exitCode 13

拟墨画扇 提交于 2019-12-13 02:47:31
问题 I'm experimenting with EMR a bit I try to run a very simple spark programm from pyspark.sql.types import IntegerType mylist = [1, 2, 3, 4] df = spark.createDataFrame(mylist, IntegerType()).show() df.write.parquet('/path/to/save', mode='overwrite') I launch the app by adding a step in the AWS EMR web-console I select the app from s3 select deploy mode cluster and leave the rest blank. The app doesn't even launch probably because I get the following error code: Application application

How to access custom UDFs through Spark Thrift Server?

依然范特西╮ 提交于 2019-12-12 16:10:02
问题 I am running Spark Thrift Server on EMR. I start up the Spark Thrift Server by: sudo -u spark /usr/lib/spark/sbin/start-thriftserver.sh --queue interactive.thrift --jars /opt/lib/custom-udfs.jar Notice that I have a customer UDF jar and I want to add it to the Thrift Server classpath, so I added --jars /opt/lib/custom-udfs.jar in the above command. Once I am in my EMR, I issued the following to connect to the Spark Thrift Server. beeline -u jdbc:hive2://localhost:10000/default Then I was able

Spark UI on AWS EMR

泄露秘密 提交于 2019-12-12 07:47:00
问题 I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no connection. I'm forwarding like so ssh -i ~/KEY.pem -L 8080:localhost:8080 hadoop@EMR_DNS 1) How do I find out what the Spark WebUI's assigned port is? 2) How do I verify the Spark WebUI is running? 回答1: Spark on EMR is configured for YARN, thus the

Backup DynamoDB Table with dynamic columns to S3

江枫思渺然 提交于 2019-12-12 05:29:56
问题 I have read several other posts about this and in particular this question with an answer by greg about how to do it in Hive. I would like to know how to account for DynamoDB tables with variable amounts of columns though? That is, the original DynamoDB table has rows that were added dynamically with different columns. I have tried to view the exportDynamoDBToS3 script that Amazon uses in their DataPipeLine service but it has code like the following which does not seem to map the columns: --

What's the aws cli command to create the default EMR-managed security groups?

只谈情不闲聊 提交于 2019-12-12 04:45:30
问题 When using the EMR web console, you can create a cluster and AWS automatically creates the EMR-managed security groups named "ElasticMapReduce-master" & "ElasticMapReduce-slave". How do you create those via the aws cli? I found aws emr create-default-roles but there's no aws emr create-default-security-groups . 回答1: As of right now, it looks like you can't. See http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-man-sec-groups.html section "To specify Amazon EMR–managed security groups

Deduce the HDFS path at runtime on EMR

自闭症网瘾萝莉.ら 提交于 2019-12-12 04:36:20
问题 I have spawned an EMR cluster with an EMR step to copy a file from S3 to HDFS and vice-versa using s3-dist-cp. This cluster is an on-demand cluster so we are not keeping track of the ip. The first EMR step is: hadoop fs -mkdir /input - This step completed successfully. The second EMR step is: Following is the command I am using: s3-dist-cp --s3Endpoint=s3.amazonaws.com --src=s3://<bucket-name>/<folder-name>/sample.txt --dest=hdfs:///input - This step FAILED I get the following exception Error

Getting “Existing lock /var/run/yum.pid: another copy is running as pid …” during bootstraping in EMR

二次信任 提交于 2019-12-12 03:07:21
问题 I need to install python3 in my EMR cluster (AMI 3.1.1) as a part of bootstraping step. So I added the following command: sudo yum install -y python3 But everytime I got an error saying the following: Existing lock /var/run/yum.pid: another copy is running as pid 1829. Another app is currently holding the yum lock; waiting for it to exit... The other application is: yum How can I avoid this error? Or is there a way to install Python 3 without going through this route? 回答1: The issue is that

Pig UDF running on AWS EMR with java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

人走茶凉 提交于 2019-12-11 16:47:47
问题 I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format ------------------------------- COLOR=Black Date=1349719200 PID=23898 Program=Java EOE ------------------------------- COLOR=White Date=1349719234 PID=23828 Program=Python EOE So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty