amazon-emr

Specify minimum number of generated files from Hive insert

故事扮演 提交于 2019-12-17 21:08:30
问题 I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent. So my questions are (a) what determines how many files are

Specify minimum number of generated files from Hive insert

与世无争的帅哥 提交于 2019-12-17 20:58:30
问题 I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent. So my questions are (a) what determines how many files are

collect() or toPandas() on a large DataFrame in pyspark/EMR

徘徊边缘 提交于 2019-12-17 14:53:28
问题 I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes.

Pass parameters to hive script using aws php sdk

Deadly 提交于 2019-12-14 03:49:55
问题 I'm trying to run an hive script on AWS EMR using the php sdk. How can I pass the script parameters (like, input, output and dates to work on)? Thanks 回答1: If you are struggling with this as well... A sample code for passing variables to hive script can be found at the following Amazon Forum Thread 回答2: I've done this with the Java SDK, using the PHP SDK essentially what you need to do is parse in the parameters you want with add_job_flow_steps function You need to add the parameters to the

How to force Hadoop to unzip inputs regadless of their extension?

独自空忆成欢 提交于 2019-12-14 02:02:18
问题 I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension. Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension. I tried passing the following flags to Hadoop: step_args=[ "-jobconf", "stream.recordreader

Configure EMR to use s3a instead of s3 for spark.sql calls

旧巷老猫 提交于 2019-12-13 18:41:25
问题 All my calls to spark.sql("") fails with the error in the stacktrace (1) below Update - 2 I have zeroed in on the problem, it is AccessDenied for sts:AssumeRule, any leads appreciated User: arn:aws:sts::00000000000:assumed-role/EMR_EC2_XXXXX_XXXXXX_POLICY/i-3232131232131232 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::00000000000:role/EMR_XXXXXX_XXXXXX_POLICY When the same location is accessed with spark.read.parquet("s3a://xxx.xxx-xxx-xx.xxxxx-xxxxx/xxx/") I was

is JSON4S compatible with spark 2.4.0 and EMR 5.26.0

♀尐吖头ヾ 提交于 2019-12-13 17:49:26
问题 Spark json4s[java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/Js] Getting above error while parsing complex json when running spark scala structured streaming application on aws emr. 回答1: It looks like a binary compatibility error... Could you please check the dependency tree for incompatible versions of json4s artifacts? If you will not able to upgrade them to use the same version then possible you can solve the problem by shading some of them with sbt-assembly

Spark Catalog w/ AWS Glue: database not found

社会主义新天地 提交于 2019-12-13 16:43:27
问题 Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via spark.catalog.setCurrentDatabase("test") spark.catalog.listTables However when I submit a job via spark-submit I get a fatal error ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.; I am creating my SparkSession within the job being submitted via spark-submit via

Error starting Spark in EMR 4.0

心不动则不痛 提交于 2019-12-13 14:29:29
问题 I created an EMR 4.0 instance in AWS with all available applications, including Spark . I did it manually, through AWS Console. I started the cluster and SSHed to the master node when it was up. There I ran pyspark . I am getting the following error when pyspark tries to create SparkContext : 2015-09-03 19:36:04,195 ERROR Thread-3 spark.SparkContext (Logging.scala:logError(96)) - -ec2-user, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode

collect() or toPandas() on a large DataFrame in pyspark/EMR

不问归期 提交于 2019-12-13 14:06:17
问题 I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes.