amazon-emr

Facing error while trying to create transient cluster on AWS emr to run Python script

谁说胖子不能爱 提交于 2020-08-10 19:17:38
问题 I am new to aws and trying to create a transient cluster on AWS emr to run a Python script. I just want to run the python script that will process the file and auto terminate the cluster post completion. I have also created a keypair and specified the same. Command below : aws emr create-cluster --name "test1-cluster" --release-label emr-5.5.0 --name pyspark_analysis --ec2-attributes KeyName=k-key-pair --applications Name=Hadoop Name=Hive Name=Spark --instance-groups --use-default-roles -

converting spark dataframe to pandas dataframe - ImportError: Pandas >= 0.19.2 must be installed

强颜欢笑 提交于 2020-08-10 06:12:12
问题 I am trying to convert spark dataframe to pandas dataframe. I am trying to in Jupyter notebook on EMR. and I am trying following error. Pandas library is installed on master node under my user. And using spark shell (pyspark) I am able to convert df to padnas df on that master node. following command has been executed on all the master nodes pip --no-cache-dir install pandas --user Following is working on master node. But not from pyspark notebook import Pandas as pd Error No module named

converting spark dataframe to pandas dataframe - ImportError: Pandas >= 0.19.2 must be installed

时间秒杀一切 提交于 2020-08-10 06:10:29
问题 I am trying to convert spark dataframe to pandas dataframe. I am trying to in Jupyter notebook on EMR. and I am trying following error. Pandas library is installed on master node under my user. And using spark shell (pyspark) I am able to convert df to padnas df on that master node. following command has been executed on all the master nodes pip --no-cache-dir install pandas --user Following is working on master node. But not from pyspark notebook import Pandas as pd Error No module named

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

。_饼干妹妹 提交于 2020-08-09 13:35:23
问题 I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 30 million unique (name_id, name) combinations for company names. Some of those names refer to the same company, but are (i) either misspelled, and/or (ii) include additional names. Performing fuzzy string matching for every combination is not possible. To reduce the number of fuzzy string matching

what's different between EMR_EC2_DefaultRole and EMR_DefaultRole?

邮差的信 提交于 2020-08-07 06:48:31
问题 After a aws emr has launched, I'v noticed that it has a ec2 instance profile EMR_EC2_DefaultRole, and a emr role EMR_DefaultRole, they have similar permissions,so what's different between EMR_EC2_DefaultRole and EMR_DefaultRole? 回答1: As Per Documentation: EMR Role The EMR role defines the allowable actions for Amazon EMR when provisioning resources and performing other service-level tasks that are not performed in the context of an EC2 instance running within a cluster. The default role is

How can multiple files be specified with “-files” in the CLI of Amazon for EMR?

左心房为你撑大大i 提交于 2020-07-23 04:08:08
问题 I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows: aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,- files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,- input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra] --ami

How can multiple files be specified with “-files” in the CLI of Amazon for EMR?

╄→尐↘猪︶ㄣ 提交于 2020-07-23 04:07:17
问题 I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows: aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,- files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,- input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra] --ami

How can multiple files be specified with “-files” in the CLI of Amazon for EMR?

梦想与她 提交于 2020-07-23 04:06:23
问题 I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows: aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,- files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,- input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra] --ami

How to restart Spark service in EMR after changing conf settings?

旧城冷巷雨未停 提交于 2020-07-04 09:00:13
问题 I am using EMR-5.9.0 and after changing some configuration files I want to restart the service to see the effect. How can I achieve this? I tried to find the name of the service using initctl list, as I saw in other answers but no luck... 回答1: Since Spark runs as an application on Hadoop Yarn you can try sudo stop hadoop-yarn-resourcemanager sudo start hadoop-yarn-resourcemanager If you meant the Spark History Server then you can use sudo stop spark-history-server sudo start spark-history

Cannot connect to remote MongoDB from EMR cluster with spark-shell

╄→гoц情女王★ 提交于 2020-06-29 12:31:55
问题 I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 : import com.stratio.datasource.mongodb._ import com.stratio.datasource.mongodb.config._ import com.stratio.datasource.mongodb.config.MongodbConfig._ val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user