amazon-emr

Folder won't delete on Amazon S3

馋奶兔 提交于 2019-12-21 03:37:40
问题 I'm trying to delete a folder created as a result of a MapReduce job. Other files in the bucket delete just fine, but this folder won't delete. When I try to delete it from the console, the progress bar next to its status just stays at 0. Have made multiple attempts, including with logout/login in between. 回答1: First and foremost, Amazon S3 doesn't actually have a native concept of folders/directories, rather is a flat storage architecture comprised of buckets and objects/keys only - the

Any Scala SDK or interface for AWS?

岁酱吖の 提交于 2019-12-20 17:36:22
问题 Does anyone know of a Scala SDK for Amazon Web Services? I am particularly interested in the EMR jobs. 回答1: Take a look at AWScala (it's a simple wrapper on top of AWS SDK for Java): https://github.com/seratch/AWScala [UPDATE from 04/07/2015]: Another very promising library from @dwhjames: Asynchronous Scala Clients for Amazon Web Services https://dwhjames.github.io/aws-wrap/ 回答2: You could use the standard Java SDK directly without any problems from Scala, however I'm not aware of any Scala

Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?

元气小坏坏 提交于 2019-12-20 17:26:05
问题 I am new to Amazon Services and facing some issues. Suppose I am running some Job Flow on Amazon Elastic Mapreduce with total 3 instances. While running my job flow on it I found that my job is taking more time to execute. And in such case I need to add more instances into it so that my instances will increase and hence job will execute fast. My question is that How to add such instance into an existing instances? Because If we terminate existed instance and again create the new instances

Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?

我们两清 提交于 2019-12-20 17:25:12
问题 I am new to Amazon Services and facing some issues. Suppose I am running some Job Flow on Amazon Elastic Mapreduce with total 3 instances. While running my job flow on it I found that my job is taking more time to execute. And in such case I need to add more instances into it so that my instances will increase and hence job will execute fast. My question is that How to add such instance into an existing instances? Because If we terminate existed instance and again create the new instances

Spark History Server behind Load Balancer is redirecting to HTTP

北城余情 提交于 2019-12-20 05:41:35
问题 I am currently using Spark on AWS EMR, but when this is behind a Load Balancer (AWS ELB), it is redirecting the traffic from https to http, which then ends up getting denied because I don't allow http traffic through the load balancer for the given port. It appears that this might derive from Yarn being a proxy as well, but I have no idea. 来源: https://stackoverflow.com/questions/56412083/spark-history-server-behind-load-balancer-is-redirecting-to-http

Spark History Server behind Load Balancer is redirecting to HTTP

廉价感情. 提交于 2019-12-20 05:41:17
问题 I am currently using Spark on AWS EMR, but when this is behind a Load Balancer (AWS ELB), it is redirecting the traffic from https to http, which then ends up getting denied because I don't allow http traffic through the load balancer for the given port. It appears that this might derive from Yarn being a proxy as well, but I have no idea. 来源: https://stackoverflow.com/questions/56412083/spark-history-server-behind-load-balancer-is-redirecting-to-http

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

跟風遠走 提交于 2019-12-20 05:12:20
问题 I'm trying to launch a cluster using AWS Cli. I use the following command: aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium The cluster is created successfully. Then I add this command: aws emr add-steps --cluster-id ID

How to avoid reading old files from S3 when appending new data?

允我心安 提交于 2019-12-19 12:06:15
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to avoid reading old files from S3 when appending new data?

筅森魡賤 提交于 2019-12-19 12:05:26
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

AWS EMR 5.11.0 - Apache Hive on Spark

旧街凉风 提交于 2019-12-19 09:44:10
问题 I am trying to setup Apache Hive on Spark on AWS EMR 5.11.0. Apache Spark Version - 2.2.1 Apache Hive Version - 2.3.2 Yarn logs show below error: 18/01/28 21:55:28 ERROR ApplicationMaster: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS at org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134) at org.apache.hive.spark