amazon-emr | 易学教程

DataType interval is not supported - Spark SQL

阅读更多关于 DataType interval is not supported - Spark SQL

问题 I am running a query on AWS EMR and the query errors out on this line - to_date('1970-01-01', 'YYYY-MM-DD') + CAST(concat(mycolumn, ' seconds') AS INTERVAL) AS date_col The error - DataType interval is not supported.(line 521, pos 82) Can someone help me with this? 回答1: I think Spark supports the interval key word. It would be used as: to_date('1970-01-01', 'YYYY-MM-DD') + mycolumn * interval '1 second' AS date_col 来源： https://stackoverflow.com/questions/59366658/datatype-interval-is-not

AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string

阅读更多关于 AWS EMR Spark: Error writing to S3 - IllegalArgumentException - Cannot create a path from an empty string

问题 I have been trying to fix this for a long time now ... no idea why I get this? FYI, I'm running Spark on a cluster on AWS EMR Cluster. I debugged and clearly see the destination path provided ... something like s3://my-bucket-name/ . The spark job creates orc files and writes them after creating a partition like so: date=2017-06-10 . Any ideas? 17/07/08 22:48:31 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Can not create a Path from an empty string

AWS EMR bootstrap action as sudo

阅读更多关于 AWS EMR bootstrap action as sudo

问题 I need to update /etc/hosts for all instances in my EMR cluster (EMR AMI 4.3). The whole script is nothing more than: #!/bin/bash echo -e 'ip1 uri1' >> /etc/hosts echo -e 'ip2 uri2' >> /etc/hosts ... This script needs to run as sudo or it fails. From here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses Bootstrap actions execute as the Hadoop user by default. You can execute a bootstrap action with root privileges by using sudo . Great news... but

Simple RDD write to DynamoDB in Spark

阅读更多关于 Simple RDD write to DynamoDB in Spark

问题 Just got stuck on trying to import a basic RDD dataset to DynamoDB. This is the code: import org.apache.hadoop.mapred.JobConf var rdd = sc.parallelize(Array(("", Map("col1" -> Map("s" -> "abc"), "col2" -> Map("n" -> "123"))))) var jobConf = new JobConf(sc.hadoopConfiguration) jobConf.set("dynamodb.output.tableName", "table_x") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") rdd.saveAsHadoopDataset(jobConf) And this is the error I get: 16/02

How to set spark.driver.memory for Spark/Zeppelin on EMR

阅读更多关于 How to set spark.driver.memory for Spark/Zeppelin on EMR

问题 When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work. I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create clusters? Is Bootstrap action could be a solution? If yes, can you please provide an example of how the bootstrap action file should look like? 回答1: You can always try to add the following configuration on job flow/cluster creation : [ { "Classification":

Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

阅读更多关于 Automatic AWS DynamoDB to S3 export failing with “role/DataPipelineDefaultRole is invalid”

问题 Precisely following the step-by-step instructions on this page I am trying to export contents of one of my DynamoDB tables to an S3 bucket. I create a pipeline exactly as instructed but it fails to run. It seems that it has trouble identifying/running an EC2 resource to do the export. When I access EMR through AWS Console, I see entries like this: Cluster: df-0..._@EmrClusterForBackup_2015-03-06T00:33:04Terminated with errorsEMR service role arn:aws:iam::...:role/DataPipelineDefaultRole is

Does Hive have something equivalent to DUAL?

阅读更多关于 Does Hive have something equivalent to DUAL?

问题 I'd like to run statements like SELECT date_add('2008-12-31', 1) FROM DUAL Does Hive (running on Amazon EMR) have something similar? 回答1: Not yet: https://issues.apache.org/jira/browse/HIVE-1558 回答2: Best solution is not to mention table name. select 1+1; Gives the result 2. But poor Hive need to spawn map reduce to find this! 回答3: To create a dual like table in hive where there is one column and one row you can do the following: create table dual (x int); insert into table dual select count(

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

阅读更多关于 Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

问题 I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] } ] } ] I am pasting this object in the

Running steps of EMR in parallel

阅读更多关于 Running steps of EMR in parallel

问题 I am running a spark-job on EMR cluster ,The issue i am facing is all the EMR jobs triggered are executing in steps (in queue) Is there any way to make them run parallel if not is there any alteration for that 回答1: Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that

Emrfs file sync with s3 not working

阅读更多关于 Emrfs file sync with s3 not working

问题 After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write: 'bucket/folder' present in the metadata but not s3 at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455) I tried running emrfs sync s3://bucket/folder which did not appear to resolve the