amazon-emr | 易学教程

Sqoop import postgres to S3 failing

阅读更多关于 Sqoop import postgres to S3 failing

问题 I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster. sqoop import \ --connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \ --username <username> \ --password-file <password_file_path> \ --table

Can't apply a pandas_udf in pyspark

阅读更多关于 Can't apply a pandas_udf in pyspark

问题 I'm trying out some pyspark related experiments on jupyter notebook attached to an AWS EMR instance. I've a spark dataframe which reads data from s3, and then filters out some stuffs. Printing the schema using df1.printSchema() outputs like this: root |-- idvalue: string (nullable = true) |-- locationaccuracyhorizontal: float (nullable = true) |-- hour: integer (nullable = true) |-- day: integer (nullable = true) |-- date: date (nullable = true) |-- is_weekend: boolean (nullable = true) |--

How to load local resource from a python package loaded in AWS PySpark

阅读更多关于 How to load local resource from a python package loaded in AWS PySpark

问题 I have uploaded a python package into AWS EMR with PySpark. My python package has a structure like the following, where I have a resource file (a sklearn joblib model) within the package: myetllib ├── Dockerfile ├── __init__.py ├── modules │ ├── bin │ ├── joblib │ ├── joblib-0.14.1.dist-info │ ├── numpy │ ├── numpy-1.18.4.dist-info │ ├── numpy.libs │ ├── scikit_learn-0.21.3.dist-info │ ├── scipy │ ├── scipy-1.4.1.dist-info │ └── sklearn ├── requirements.txt └── mysubmodule ├── __init__.py ├──

Resource optimization/utilization in EMR for long running job and multiple small running jobs

阅读更多关于 Resource optimization/utilization in EMR for long running job and multiple small running jobs

问题 My use-case: We have a long running Spark job. Here after called, LRJ . This job runs once in a week. We have multiple small running jobs that can come at any time. These jobs has high priority than the long running job. To address this, we created YARN queues as below: Created YARN Queues for resource management. Configured Q1 queue for long running job and Q2 queue for small running jobs. Config: Q1 : capacity = 50% and it can go upto 100% capacity on CORE nodes = 50% and maximum 100% Q2 :

Install com.databricks.spark.xml on emr cluster

阅读更多关于 Install com.databricks.spark.xml on emr cluster

问题 Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster. I succeeded to connect to master emr but don't know how to install packages on the emr cluster. code sc.install_pypi_package("com.databricks.spark.xml") 回答1: On EMR Master node: cd /usr/lib/spark/jars sudo wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar Make sure to select the correct jar according to your Spark version and the guidelines provided in

Yarn queue capacity not working as expected for CORE nodes on EMR (emr-5.26.0)

阅读更多关于 Yarn queue capacity not working as expected for CORE nodes on EMR (emr-5.26.0)

问题 This bounty has ended . Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 18 hours . san is looking for an answer from a reputable source . Usecase => Create two YARN queues: Q1 and Q2 with the configuration below. [ { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.root.queues" : "Q1,Q2", "yarn.scheduler.capacity.root.Q1.capacity" : "60", "yarn.scheduler.capacity.root.Q2.capacity" : "40", "yarn.scheduler.capacity

aws: EMR cluster fails “ERROR UserData: Error encountered while try to get user data” on submitting spark job

阅读更多关于 aws: EMR cluster fails “ERROR UserData: Error encountered while try to get user data” on submitting spark job

问题 Successfully started aws EMR cluster, but any submission fails with: 19/07/30 08:37:42 ERROR UserData: Error encountered while try to get user data java.io.IOException: File '/var/aws/emr/userData.json' cannot be read at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:296) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1711) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io

error when running sqoop2 server on Amazon EMR with yarn

阅读更多关于 error when running sqoop2 server on Amazon EMR with yarn

问题 I'm trying to install sqoop 2 (version 1.99.3) on an Amazon EMR cluster (AMI version 3.2.0 / Hadoop version 2.4.0). When I start the sqoop server, I see this error in localhost.log: Sep 10, 2014 4:55:56 PM org.apache.catalina.core.StandardContext listenerStart SEVERE: Exception sending context initialized event to listener instance of class org.apache.sqoop.server.ServerInitializer java.lang.RuntimeException: Failure in server initialization at org.apache.sqoop.core.SqoopServer.initialize

Create EMR 5.3.0 with EMRFS (s3 bucket) as storage

阅读更多关于 Create EMR 5.3.0 with EMRFS (s3 bucket) as storage

问题 I'm trying to create EMR 5.3.0 with EMRFS (S3 bucket) as storage. Please provide your general guidance regarding this. Currently i'm using below command for creating EMR 5.3.0 with InstanceType=m4.2xlarge.Which is working fine, but with EMRFS as storage i'm not able to do aws emr create-cluster --name "DEMAPAUR001" --release-label emr-5.3.0 --service-role EMR_DefaultRole_Private --enable-debug --log-uri 's3n://xyz/trn' --ec2-attributes SubnetId=subnet-545e8823, KeyName=XXX --applications Name

Spark jobs running on EMR cluster. system.exit(0) used to gracefully completion of job but still Step on EMR fails

阅读更多关于 Spark jobs running on EMR cluster. system.exit(0) used to gracefully completion of job but still Step on EMR fails

问题 In spark job. I am using if file not found the system.exit(0). It should gracefully complete the job. Locally It is successfully completed. But when I am running on EMR. Step is failing. 回答1: EMR uses YARN for cluster management and launching Spark applications. So when you're running a Spark app with --deploy-mode: cluster in EMR, the Spark application code is not running in a JVM on its own, but is rather executed by the ApplicationMaster class. Browsing through the ApplicationMaster code