amazon-emr | 易学教程

Cannot connect to remote MongoDB from EMR cluster with spark-shell

阅读更多关于 Cannot connect to remote MongoDB from EMR cluster with spark-shell

问题 I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 : import com.stratio.datasource.mongodb._ import com.stratio.datasource.mongodb.config._ import com.stratio.datasource.mongodb.config.MongodbConfig._ val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user

Egg/JAR equivalent for Sparklyr projects

阅读更多关于 Egg/JAR equivalent for Sparklyr projects

问题 We have a SparklyR project which is set up like this # load functions source('./a.R') source('./b.R') source('./c.R') .... # main script computations sc -> spark_connect(...) read_csv(sc, s3://path) .... Running it on EMR spark-submit --deploy-mode client s3://path/to/my/script.R Running this script using spark-submit above fails since it seems to only take a single R script but we are sourcing functions from multiple files. Is there a way we can package this as an egg/jar file with all of

Egg/JAR equivalent for Sparklyr projects

阅读更多关于 Egg/JAR equivalent for Sparklyr projects

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

阅读更多关于 How do you automate pyspark jobs on emr using boto3 (or otherwise)?

问题 I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to S3 Upload the data from S3 to Redshift. I'm getting hung up on how to automate this though so that my process spins up an EMR cluster, bootstraps the correct programs for installation, and runs my python script that will contain the code for parsing and

AWS DAX cluster has zero cache hits and cache miss

阅读更多关于 AWS DAX cluster has zero cache hits and cache miss

问题 I'm using an AWS DAX cluster of 3 nodes of dax.r4.xlarge node type. When I'm running my spark application from EMR cluster, it is always fetching values from dynamodb table. Even if I run the same application on same set of key, it is querying dynamodb table. In the DAX cluster metrics I see 0 cache hits and misses. 回答1: I found the mistake. Initially I was hitting DynamoDB directly and was using consistent reads by defining get item input parameter as: ConsistentRead: aws.Bool(true) When I

Duplicate partition columns on write s3

阅读更多关于 Duplicate partition columns on write s3

问题 I'm processing data and writing it to s3 using the following code: spark = SparkSession.builder.config('spark.sql.sources.partitionOverwriteMode', 'dynamic').getOrCreate() df = spark.read.parquet('s3://<some bucket>/<some path>').filter(F.col('processing_hr') == <val>) transformed_df = do_lots_of_transforms(df) # here's the important bit on how I'm writing it out transformed_df.write.mode('overwrite').partitionBy('processing_hr').parquet('s3://bucket_name/location') Basically, I'm trying to

Duplicate partition columns on write s3

阅读更多关于 Duplicate partition columns on write s3

How to terminate AWS EMR Cluster automatically after some time

阅读更多关于 How to terminate AWS EMR Cluster automatically after some time

问题 I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long running jar which will sit on the EMR Master node and just poll yarn for some idle

ETL 1.5 GB Dataframe within pyspark on AWS EMR

阅读更多关于 ETL 1.5 GB Dataframe within pyspark on AWS EMR

问题 I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data). Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts

Sqoop import postgres to S3 failing

阅读更多关于 Sqoop import postgres to S3 failing

问题 I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster. sqoop import \ --connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \ --username <username> \ --password-file <password_file_path> \ --table