amazon-emr

Cannot connect to remote MongoDB from EMR cluster with spark-shell

半城伤御伤魂 提交于 2020-06-29 12:29:06
问题 I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 : import com.stratio.datasource.mongodb._ import com.stratio.datasource.mongodb.config._ import com.stratio.datasource.mongodb.config.MongodbConfig._ val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user

Egg/JAR equivalent for Sparklyr projects

隐身守侯 提交于 2020-06-26 14:17:26
问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

Egg/JAR equivalent for Sparklyr projects

旧巷老猫 提交于 2020-06-26 14:17:00
问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

吃可爱长大的小学妹 提交于 2020-06-24 04:51:08
问题 I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to S3 Upload the data from S3 to Redshift. I'm getting hung up on how to automate this though so that my process spins up an EMR cluster, bootstraps the correct programs for installation, and runs my python script that will contain the code for parsing and

AWS DAX cluster has zero cache hits and cache miss

淺唱寂寞╮ 提交于 2020-06-17 09:15:15
问题 I'm using an AWS DAX cluster of 3 nodes of dax.r4.xlarge node type. When I'm running my spark application from EMR cluster, it is always fetching values from dynamodb table. Even if I run the same application on same set of key, it is querying dynamodb table. In the DAX cluster metrics I see 0 cache hits and misses. 回答1: I found the mistake. Initially I was hitting DynamoDB directly and was using consistent reads by defining get item input parameter as: ConsistentRead: aws.Bool(true) When I

Duplicate partition columns on write s3

*爱你&永不变心* 提交于 2020-06-12 15:41:55
问题 I'm processing data and writing it to s3 using the following code: spark = SparkSession.builder.config('spark.sql.sources.partitionOverwriteMode', 'dynamic').getOrCreate() df = spark.read.parquet('s3://<some bucket>/<some path>').filter(F.col('processing_hr') == <val>) transformed_df = do_lots_of_transforms(df) # here's the important bit on how I'm writing it out transformed_df.write.mode('overwrite').partitionBy('processing_hr').parquet('s3://bucket_name/location') Basically, I'm trying to

Duplicate partition columns on write s3

旧城冷巷雨未停 提交于 2020-06-12 15:39:12
问题 I'm processing data and writing it to s3 using the following code: spark = SparkSession.builder.config('spark.sql.sources.partitionOverwriteMode', 'dynamic').getOrCreate() df = spark.read.parquet('s3://<some bucket>/<some path>').filter(F.col('processing_hr') == <val>) transformed_df = do_lots_of_transforms(df) # here's the important bit on how I'm writing it out transformed_df.write.mode('overwrite').partitionBy('processing_hr').parquet('s3://bucket_name/location') Basically, I'm trying to

How to terminate AWS EMR Cluster automatically after some time

柔情痞子 提交于 2020-06-11 01:03:32
问题 I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long running jar which will sit on the EMR Master node and just poll yarn for some idle

ETL 1.5 GB Dataframe within pyspark on AWS EMR

孤人 提交于 2020-06-01 07:39:21
问题 I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data). Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts

Sqoop import postgres to S3 failing

只谈情不闲聊 提交于 2020-06-01 07:22:05
问题 I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster. sqoop import \ --connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \ --username <username> \ --password-file <password_file_path> \ --table