amazon-emr

Sqoop import postgres to S3 failing

有些话、适合烂在心里 提交于 2020-06-01 07:21:07
问题 I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster. sqoop import \ --connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \ --username <username> \ --password-file <password_file_path> \ --table

Can't apply a pandas_udf in pyspark

我怕爱的太早我们不能终老 提交于 2020-06-01 07:19:04
问题 I'm trying out some pyspark related experiments on jupyter notebook attached to an AWS EMR instance. I've a spark dataframe which reads data from s3, and then filters out some stuffs. Printing the schema using df1.printSchema() outputs like this: root |-- idvalue: string (nullable = true) |-- locationaccuracyhorizontal: float (nullable = true) |-- hour: integer (nullable = true) |-- day: integer (nullable = true) |-- date: date (nullable = true) |-- is_weekend: boolean (nullable = true) |--

How to load local resource from a python package loaded in AWS PySpark

余生颓废 提交于 2020-05-28 11:59:10
问题 I have uploaded a python package into AWS EMR with PySpark. My python package has a structure like the following, where I have a resource file (a sklearn joblib model) within the package: myetllib ├── Dockerfile ├── __init__.py ├── modules │ ├── bin │ ├── joblib │ ├── joblib-0.14.1.dist-info │ ├── numpy │ ├── numpy-1.18.4.dist-info │ ├── numpy.libs │ ├── scikit_learn-0.21.3.dist-info │ ├── scipy │ ├── scipy-1.4.1.dist-info │ └── sklearn ├── requirements.txt └── mysubmodule ├── __init__.py ├──

Resource optimization/utilization in EMR for long running job and multiple small running jobs

家住魔仙堡 提交于 2020-05-16 06:04:11
问题 My use-case: We have a long running Spark job. Here after called, LRJ . This job runs once in a week. We have multiple small running jobs that can come at any time. These jobs has high priority than the long running job. To address this, we created YARN queues as below: Created YARN Queues for resource management. Configured Q1 queue for long running job and Q2 queue for small running jobs. Config: Q1 : capacity = 50% and it can go upto 100% capacity on CORE nodes = 50% and maximum 100% Q2 :

Install com.databricks.spark.xml on emr cluster

情到浓时终转凉″ 提交于 2020-04-30 11:43:29
问题 Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster. I succeeded to connect to master emr but don't know how to install packages on the emr cluster. code sc.install_pypi_package("com.databricks.spark.xml") 回答1: On EMR Master node: cd /usr/lib/spark/jars sudo wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar Make sure to select the correct jar according to your Spark version and the guidelines provided in

Yarn queue capacity not working as expected for CORE nodes on EMR (emr-5.26.0)

旧时模样 提交于 2020-04-18 06:09:20
问题 This bounty has ended . Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 18 hours . san is looking for an answer from a reputable source . Usecase => Create two YARN queues: Q1 and Q2 with the configuration below. [ { "Classification": "capacity-scheduler", "Properties": { "yarn.scheduler.capacity.root.queues" : "Q1,Q2", "yarn.scheduler.capacity.root.Q1.capacity" : "60", "yarn.scheduler.capacity.root.Q2.capacity" : "40", "yarn.scheduler.capacity

aws: EMR cluster fails “ERROR UserData: Error encountered while try to get user data” on submitting spark job

♀尐吖头ヾ 提交于 2020-04-07 04:00:09
问题 Successfully started aws EMR cluster, but any submission fails with: 19/07/30 08:37:42 ERROR UserData: Error encountered while try to get user data java.io.IOException: File '/var/aws/emr/userData.json' cannot be read at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:296) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1711) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io

error when running sqoop2 server on Amazon EMR with yarn

梦想的初衷 提交于 2020-03-26 08:11:23
问题 I'm trying to install sqoop 2 (version 1.99.3) on an Amazon EMR cluster (AMI version 3.2.0 / Hadoop version 2.4.0). When I start the sqoop server, I see this error in localhost.log: Sep 10, 2014 4:55:56 PM org.apache.catalina.core.StandardContext listenerStart SEVERE: Exception sending context initialized event to listener instance of class org.apache.sqoop.server.ServerInitializer java.lang.RuntimeException: Failure in server initialization at org.apache.sqoop.core.SqoopServer.initialize

Create EMR 5.3.0 with EMRFS (s3 bucket) as storage

假如想象 提交于 2020-02-25 05:28:12
问题 I'm trying to create EMR 5.3.0 with EMRFS (S3 bucket) as storage. Please provide your general guidance regarding this. Currently i'm using below command for creating EMR 5.3.0 with InstanceType=m4.2xlarge.Which is working fine, but with EMRFS as storage i'm not able to do aws emr create-cluster --name "DEMAPAUR001" --release-label emr-5.3.0 --service-role EMR_DefaultRole_Private --enable-debug --log-uri 's3n://xyz/trn' --ec2-attributes SubnetId=subnet-545e8823, KeyName=XXX --applications Name

Spark jobs running on EMR cluster. system.exit(0) used to gracefully completion of job but still Step on EMR fails

雨燕双飞 提交于 2020-02-24 14:42:18
问题 In spark job. I am using if file not found the system.exit(0). It should gracefully complete the job. Locally It is successfully completed. But when I am running on EMR. Step is failing. 回答1: EMR uses YARN for cluster management and launching Spark applications. So when you're running a Spark app with --deploy-mode: cluster in EMR, the Spark application code is not running in a JVM on its own, but is rather executed by the ApplicationMaster class. Browsing through the ApplicationMaster code