amazon-emr

file path in hdfs

与世无争的帅哥 提交于 2020-01-01 04:35:08
问题 I want to read the file from the Hadoop File System. In order to achieve the correct path of the file, I need host name and port address of the hdfs . so finally my path of the file will look something like Path path = new Path("hdfs://123.23.12.4344:9000/user/filename.txt") Now I want to know to extract the HostName = "123.23.12.4344" & port: 9000? Basically, I want to access the FileSystem on Amazon EMR but, when I use FileSystem fs = FileSystem.get(getConf()); I get You possibly called

S3 and EMR data locality

為{幸葍}努か 提交于 2020-01-01 02:31:27
问题 Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud: EC2 EMR + S3 The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of

AWS CLI EMR get Master node Instance ID and tag it

荒凉一梦 提交于 2019-12-30 10:38:31
问题 I want to automate the running of a cluster and can use tags to get attributes of an EC2 instance like its instance-id. The documentation on https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html states that --tags (list) A list of tags to associate with a cluster, which apply to each Amazon EC2 instance in the cluster. Tags are key-value pairs that consist of a required key string with a maximum of 128 characters, and an optional value string with a maximum of 256

EMR Cluster Creation using Airflow dag run, Once task is done EMR will be terminated

左心房为你撑大大i 提交于 2019-12-29 09:53:26
问题 I have Airflow jobs, which are running fine on the EMR cluster. what I need is, let's say if I have a 4 airflow jobs which required an EMR cluster for let's say 20 min to complete the task. why not we can create an EMR cluster at DAG run time and once the job is to finish it will terminate the created an EMR cluster. 回答1: Absolutely, that would be the most efficient use of resources. Let me warn you: there are a lot of details in this; I'll try to list as many as would get you going. I

How to execute spark submit on amazon EMR from Lambda function?

强颜欢笑 提交于 2019-12-29 03:05:35
问题 I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function. Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step. 回答1: You can, I had to same thing last week! Using boto3 for Python (other languages

How do I scale my AWS EMR cluster with 1 master and 2 core nodes using AWS auto scaling? Is there a way?

不羁岁月 提交于 2019-12-25 11:15:24
问题 I have implemented a cluster using AWS EMR. I have a master ndoe with 2 core nodes with hadoop bootstrap action. Now, I would like to use autoscaling and adjust the cluster size dynamically based on cpu threshold and some other constraints. BUt, I have no idea as there isn't much information on the web on how to use AutoScaling on already existing cluster. Any help. 回答1: Currently you can't launch a EMR CLuster in a AutoScaling Group. But you can achieve a very similar goal by delivering your

How do I scale my AWS EMR cluster with 1 master and 2 core nodes using AWS auto scaling? Is there a way?

筅森魡賤 提交于 2019-12-25 11:14:42
问题 I have implemented a cluster using AWS EMR. I have a master ndoe with 2 core nodes with hadoop bootstrap action. Now, I would like to use autoscaling and adjust the cluster size dynamically based on cpu threshold and some other constraints. BUt, I have no idea as there isn't much information on the web on how to use AutoScaling on already existing cluster. Any help. 回答1: Currently you can't launch a EMR CLuster in a AutoScaling Group. But you can achieve a very similar goal by delivering your

Spark AWS emr checkpoint location

六月ゝ 毕业季﹏ 提交于 2019-12-25 09:10:45
问题 I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message 17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Wrong FS: s3://spark- jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected- components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020 java.lang.IllegalArgumentException: Wrong FS: s3://spark- jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-

Tables not found in Spark SQL after migrating from EMR to AWS Glue

耗尽温柔 提交于 2019-12-25 01:18:40
问题 I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata. I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...") Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR. I can work around

Too many open files in spark aborting spark job

Deadly 提交于 2019-12-25 00:18:52
问题 In my application i am reading 40 GB text files that is totally spread across 188 files . I split this files and create xml files per line in spark using pair rdd . For 40 GB of input it will create many millions small xml files and this is my requirement. All working fine but when spark saves files in S3 it throws error and job fails . Here is the exception i get Caused by: java.nio.file.FileSystemException: /mnt/s3/emrfs-2408623010549537848/0000000000: Too many open files at sun.nio.fs