amazon-emr

DataType interval is not supported - Spark SQL

放肆的年华 提交于 2019-12-24 18:59:28
问题 I am running a query on AWS EMR and the query errors out on this line - to_date('1970-01-01', 'YYYY-MM-DD') + CAST(concat(mycolumn, ' seconds') AS INTERVAL) AS date_col The error - DataType interval is not supported.(line 521, pos 82) Can someone help me with this? 回答1: I think Spark supports the interval key word. It would be used as: to_date('1970-01-01', 'YYYY-MM-DD') + mycolumn * interval '1 second' AS date_col 来源: https://stackoverflow.com/questions/59366658/datatype-interval-is-not

AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string

安稳与你 提交于 2019-12-24 07:15:08
问题 I have been trying to fix this for a long time now ... no idea why I get this? FYI, I'm running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destination path provided ... something like s3://my-bucket-name/ . The spark job creates orc files and writes them after creating a partition like so: date=2017-06-10 . Any ideas? 17/07/08 22:48:31 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Can not create a Path from an empty string

AWS EMR bootstrap action as sudo

做~自己de王妃 提交于 2019-12-23 01:35:45
问题 I need to update /etc/hosts for all instances in my EMR cluster (EMR AMI 4.3). The whole script is nothing more than: #!/bin/bash echo -e 'ip1 uri1' >> /etc/hosts echo -e 'ip2 uri2' >> /etc/hosts ... This script needs to run as sudo or it fails. From here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses Bootstrap actions execute as the Hadoop user by default. You can execute a bootstrap action with root privileges by using sudo . Great news... but

Simple RDD write to DynamoDB in Spark

巧了我就是萌 提交于 2019-12-22 08:52:28
问题 Just got stuck on trying to import a basic RDD dataset to DynamoDB. This is the code: import org.apache.hadoop.mapred.JobConf var rdd = sc.parallelize(Array(("", Map("col1" -> Map("s" -> "abc"), "col2" -> Map("n" -> "123"))))) var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.output.tableName", "table_x") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") rdd.saveAsHadoopDataset(jobConf) And this is the error I get: 16/02

How to set spark.driver.memory for Spark/Zeppelin on EMR

£可爱£侵袭症+ 提交于 2019-12-22 07:00:03
问题 When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work. I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create clusters? Is Bootstrap action could be a solution? If yes, can you please provide an example of how the bootstrap action file should look like? 回答1: You can always try to add the following configuration on job flow/cluster creation : [ { "Classification":

Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

こ雲淡風輕ζ 提交于 2019-12-22 04:01:26
问题 Precisely following the step-by-step instructions on this page I am trying to export contents of one of my DynamoDB tables to an S3 bucket. I create a pipeline exactly as instructed but it fails to run. It seems that it has trouble identifying/running an EC2 resource to do the export. When I access EMR through AWS Console, I see entries like this: Cluster: df-0..._@EmrClusterForBackup_2015-03-06T00:33:04Terminated with errorsEMR service role arn:aws:iam::...:role/DataPipelineDefaultRole is

Does Hive have something equivalent to DUAL?

不羁岁月 提交于 2019-12-22 01:27:31
问题 I'd like to run statements like SELECT date_add('2008-12-31', 1) FROM DUAL Does Hive (running on Amazon EMR) have something similar? 回答1: Not yet: https://issues.apache.org/jira/browse/HIVE-1558 回答2: Best solution is not to mention table name. select 1+1; Gives the result 2. But poor Hive need to spawn map reduce to find this! 回答3: To create a dual like table in hive where there is one column and one row you can do the following: create table dual (x int); insert into table dual select count(

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

孤街浪徒 提交于 2019-12-21 16:17:49
问题 I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] } ] } ] I am pasting this object in the

Running steps of EMR in parallel

不想你离开。 提交于 2019-12-21 13:01:22
问题 I am running a spark-job on EMR cluster ,The issue i am facing is all the EMR jobs triggered are executing in steps (in queue) Is there any way to make them run parallel if not is there any alteration for that 回答1: Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that

Emrfs file sync with s3 not working

不羁的心 提交于 2019-12-21 07:56:46
问题 After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write: 'bucket/folder' present in the metadata but not s3 at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455) I tried running emrfs sync s3://bucket/folder which did not appear to resolve the