emr

How can I get spark on emr-5.2.1 to write to dynamodb?

浪子不回头ぞ 提交于 2021-01-29 03:16:29
问题 According to this article here, when I create an aws emr cluster that will use spark to pipe data to dynamodb, I need to preface with the line: spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar This line appears in numerous references, including from the amazon devs themselves. However, when I run create-cluster with an added --jars flag, I get this error: Exception in thread "main" java.io.FileNotFoundException: File file:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not

Jupyter + EMR + Spark - Connect to EMR cluster from Jupyter notebook on local machine

微笑、不失礼 提交于 2021-01-28 10:13:13
问题 I am new to PySpark and EMR. I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors. I am generating SparkSession using following code: spark = SparkSession.builder \ .master("local[*]")\ .appName("Carbon - SingleWell parallelization on Spark")\ .getOrCreate() Tried following to access Remote cluster, but it errored out: spark = SparkSession.builder \ .master("spark://<remote-emr-ec2-hostname>:7077")\ .appName("Carbon - SingleWell parallelization

Spark + EMR using Amazon's “maximizeResourceAllocation” setting does not use all cores/vcores

强颜欢笑 提交于 2020-08-20 18:01:06
问题 I'm running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, "this option calculates the maximum compute and memory resources available for an executor on a node in the core node group and sets the corresponding spark-defaults settings with this information". I'm running the cluster using m3.2xlarge instances for the worker nodes. I'm using a single m3.xlarge for the YARN master - the smallest

How to restart Spark service in EMR after changing conf settings?

旧城冷巷雨未停 提交于 2020-07-04 09:00:13
问题 I am using EMR-5.9.0 and after changing some configuration files I want to restart the service to see the effect. How can I achieve this? I tried to find the name of the service using initctl list, as I saw in other answers but no luck... 回答1: Since Spark runs as an application on Hadoop Yarn you can try sudo stop hadoop-yarn-resourcemanager sudo start hadoop-yarn-resourcemanager If you meant the Spark History Server then you can use sudo stop spark-history-server sudo start spark-history

Presto coordinator returning 404 error when connecting through Terradata odbc driver

半腔热情 提交于 2020-06-29 08:58:16
问题 I am attempting to connect to a presto coordinator that resides on an EMR cluster. I am using the Terradata ODBC driver. I have both tested the driver by putting the pertinent details into the DSN via ODBC connections dialog and written a simple C# application that creates a connection (see the code below). The problem is that I am getting a 404 error returned when the connection is either tested in the DSN dialog or opened in the C# code. I believe the security group settings in AWS are fine

Spark RDD method “saveAsTextFile” throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

血红的双手。 提交于 2020-04-13 17:20:18
问题 I am calling this method on an RDD[String] with destination in the arguments. (Scala) Even after deleting the directory before starting, the process gives this error. I am running this process on EMR cluster with output location at aws S3. Below is the command used: spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g -

Spark RDD method “saveAsTextFile” throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

时间秒杀一切 提交于 2020-04-13 17:17:58
问题 I am calling this method on an RDD[String] with destination in the arguments. (Scala) Even after deleting the directory before starting, the process gives this error. I am running this process on EMR cluster with output location at aws S3. Below is the command used: spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g -

Spark RDD method “saveAsTextFile” throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

本小妞迷上赌 提交于 2020-04-13 17:17:09
问题 I am calling this method on an RDD[String] with destination in the arguments. (Scala) Even after deleting the directory before starting, the process gives this error. I am running this process on EMR cluster with output location at aws S3. Below is the command used: spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g -