amazon-emr

%matplotlib inline magic command fails to read variables from previous cells in AWS-EMR Jupyterhub Notebook

孤者浪人 提交于 2020-02-02 13:34:18
问题 I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub. I'm able to plot in a single cell using matplotlib like below: %matplotlib inline import matplotlib import matplotlib.pyplot as plt df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5] plt.plot(df) Now the above code snippet works pretty neatly for me. After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this: -Cell 1-

%matplotlib inline magic command fails to read variables from previous cells in AWS-EMR Jupyterhub Notebook

我是研究僧i 提交于 2020-02-02 13:33:20
问题 I'm trying to plot spark dataset using matplotlib after converting it to pandas dataframe in AWS EMR jupyterhub. I'm able to plot in a single cell using matplotlib like below: %matplotlib inline import matplotlib import matplotlib.pyplot as plt df = [1, 1.6, 3, 4.2, 5, 4, 2.5, 3, 1.5] plt.plot(df) Now the above code snippet works pretty neatly for me. After this sample example, I moved ahead to plot my pandas dataframe from a new/multiple cells in AWS-EMR Jupyterhub like this: -Cell 1-

fs.s3 configuration with two s3 account with EMR

若如初见. 提交于 2020-01-25 10:10:23
问题 I have pipeline using lambda and EMR, where I read csv from one s3 account A and write parquet to another s3 in account B. I created EMR in account B and has access to s3 in account B. I cannot add account A s3 bucket access in EMR_EC2_DefaultRole(as this account is enterprise wide data storage), so i use accessKey, secret key to access account A s3 bucket.This is done through congnito token. METHOD1 I am using fs.s3 protocol to read csv from s3 from account A and writing to s3 on account B.

Sending Commands from Jupyter/IPython running on EC2 to EMR cluster

ぐ巨炮叔叔 提交于 2020-01-25 01:25:12
问题 Can we send commands from Jupyter/IPython notebook running on AWS EC2 to AWS EMR having our code of word count ? I have followed the following url for installation of Jupyter on EC2. There is another link which installs Jupyter on EMR and performs Word Count However i want to seperate the Jupyter to come on EC2 and Word Count to execute on EMR. Is there any way that this could be done ? 回答1: This AWS Big Data Blog post for running Zeppelin on an external EC2 host connected to an EMR cluster

Creating Hive table on top of multiple parquet files in s3

荒凉一梦 提交于 2020-01-24 20:51:07
问题 We have our dataset in s3 (parquet files) in the below format, data divided as multiple parquet files based on the row number. data1_1000000.parquet data1000001_2000000.parquet data2000001_3000000.parquet ... We have more than 2000 such files and each file has million records on it. All these files have same number of columns and structure. And one of the column has timestamp in it if we need to partition the dataset in hive. How can we point the dataset and create a single hive external

How to use S3DistCp in java code

别说谁变了你拦得住时间么 提交于 2020-01-24 09:33:50
问题 I want to copy output of job from EMR cluster to Amazon S3 pro-grammatically. How to use S3DistCp in java code to do the same. 回答1: hadoop ToolRunner can run this.. since S3DistCP extends Tool Below is the usage example: import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.util.ToolRunner; import com.amazon.external.elasticmapreduce.s3distcp.S3DistCp public class CustomS3DistCP{ private static final Log log = LogFactory.getLog

How to use S3DistCp in java code

非 Y 不嫁゛ 提交于 2020-01-24 09:33:11
问题 I want to copy output of job from EMR cluster to Amazon S3 pro-grammatically. How to use S3DistCp in java code to do the same. 回答1: hadoop ToolRunner can run this.. since S3DistCP extends Tool Below is the usage example: import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.util.ToolRunner; import com.amazon.external.elasticmapreduce.s3distcp.S3DistCp public class CustomS3DistCP{ private static final Log log = LogFactory.getLog

AWS Glue pricing against AWS EMR

妖精的绣舞 提交于 2020-01-21 03:20:29
问题 I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating

AWS Glue pricing against AWS EMR

此生再无相见时 提交于 2020-01-21 03:20:07
问题 I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating

How to integrate Ganglia for Spark 2.1 Job metrics, Spark ignoring Ganglia metrics

让人想犯罪 __ 提交于 2020-01-17 08:00:12
问题 I am trying to integrate Spark 2.1 job's metrics to Ganglia. My spark-default.conf looks like *.sink.ganglia.class org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.name Name *.sink.ganglia.host $MASTERIP *.sink.ganglia.port $PORT *.sink.ganglia.mode unicast *.sink.ganglia.period 10 *.sink.ganglia.unit seconds When i submit my job i can see the warn Warning: Ignoring non-spark config property: *.sink.ganglia.host=host Warning: Ignoring non-spark config property: *.sink.ganglia.name