emr

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

孤街浪徒 提交于 2019-12-21 16:17:49
问题 I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] } ] } ] I am pasting this object in the

Boto3 EMR - Hive step

百般思念 提交于 2019-12-21 12:46:36
问题 Is it possible to carry out hive steps using boto 3? I have been doing so using AWS CLI, but from the docs (http://boto3.readthedocs.org/en/latest/reference/services/emr.html#EMR.Client.add_job_flow_steps), it seems like only jars are accepted. If Hive steps are possible, where are the resources? Thanks 回答1: I was able to get this to work using Boto3: # First create your hive command line arguments hive_args = "hive -v -f s3://user/hadoop/hive.hql" # Split the hive args to a list hive_args

Autoscaling EMR- is it required? Should I just use EC2? Should I just use Qubole?

半腔热情 提交于 2019-12-21 03:20:49
问题 In order to reduce the time for provisioning, we've decided to keep up a dedicated EMR cluster with 5 instances (we expect to need about 5). In case we need more, we think we'll need to implement some sort of autoscaling. I'm not familiar at all with EMR- does it support autoscaling? I found this in the docs: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-resize.html Is that the correct place to look for autoscaling or am I misunderstanding what they mean by

Spark Container & Executor OOMs during `reduceByKey`

荒凉一梦 提交于 2019-12-21 02:03:45
问题 I'm running a Spark job on Amazon's EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size. The job joins the data together (using reduceByKey ), does some maps and filters, and saves it to S3 in Parquet format. While the job uses Dataframes for saving, all of our actual transformations and actions are performed on RDDs. Note, I've included a detailed rundown of my current configurations and values with which I've experimented already after

Spark Container & Executor OOMs during `reduceByKey`

情到浓时终转凉″ 提交于 2019-12-21 02:03:23
问题 I'm running a Spark job on Amazon's EMR in client mode with YARN, using pyspark, to process data from two input files (totaling 200 GB) in size. The job joins the data together (using reduceByKey ), does some maps and filters, and saves it to S3 in Parquet format. While the job uses Dataframes for saving, all of our actual transformations and actions are performed on RDDs. Note, I've included a detailed rundown of my current configurations and values with which I've experimented already after

How to set up Zeppelin to work with remote EMR Yarn cluster

…衆ロ難τιáo~ 提交于 2019-12-21 02:01:07
问题 I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running. I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success. Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine? 回答1: [1] install Zeppelin with proper params: git clone https:

YARN: What is the difference between number-of-executors and executor-cores in Spark?

∥☆過路亽.° 提交于 2019-12-20 20:11:32
问题 I am learning Spark on AWS EMR. In the process I am trying to understand the difference between number of executors(--num-executors) and executor cores (--executor-cores). Can any one please tell me here? Also when I am trying to submit the following job, I am getting error: spark-submit --deploy-mode cluster --master yarn --num-executors 1 --executor-cores 5 --executor-memory 1g -–conf spark.yarn.submit.waitAppCompletion=false wordcount.py s3://test/spark-example/input/input.txt s3://test

Any Scala SDK or interface for AWS?

岁酱吖の 提交于 2019-12-20 17:36:22
问题 Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs. 回答1: Take a look at AWScala (it's a simple wrapper on top of AWS SDK for Java): https://github.com/seratch/AWScala [UPDATE from 04/07/2015]: Another very promising library from @dwhjames: Asynchronous Scala Clients for Amazon Web Services https://dwhjames.github.io/aws-wrap/ 回答2: You could use the standard Java SDK directly without any problems from Scala, however I'm not aware of any Scala

SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

て烟熏妆下的殇ゞ 提交于 2019-12-20 08:19:18
问题 I am trying to create a simple sql query on S3 events using Spark. I am loading ~30GB of JSON files as following: val d2 = spark.read.json("s3n://myData/2017/02/01/1234"); d2.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK); d2.registerTempTable("d2"); Then I am trying to write to file the result of my query: val users_count = sql("select count(distinct data.user_id) from d2"); users_count.write.format("com.databricks.spark.csv").option("header", "true").save("s3n://myfolder

Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task

半城伤御伤魂 提交于 2019-12-20 03:43:25
问题 Context: I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. I have a timestamp column (INT96) in parquet data which is not supported in Avroschema. Error is while parsing the timestamp Issue Stack trace is: Error: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96