spark-submit

Copy files (config) from HDFS to local working directory of every spark executor

亡梦爱人 提交于 2019-12-01 13:10:30
I am looking how to copy a folder with files of resource dependencies from HDFS to a local working directory of each spark executor using Java. I was at first thinking of using --files FILES option of spark-submit but it seems it does not support folders of files of arbitrary nesting. So, it appears I have to do it via putting this folder on a shared HDFS path to be copied correctly by each executor to its working directory before running a job but yet to find out how to do it correctly in Java code. Or zip/gzip/archive this folder, put it on shared HDFS path, and then explode the archive to

spark-submit config through file

谁说我不能喝 提交于 2019-12-01 09:04:37
I am trying to deploy spark job by using spark-submit which has bunch of parameters like spark-submit --class Eventhub --master yarn --deploy-mode cluster --executor-memory 1024m --executor-cores 4 --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf" I was looking a way to put all these flags in file to pass to spark-submit to make my spark-submit command simple liek this spark-submit --class Eventhub --master yarn --deploy-mode cluster --config-file my-app.cfg --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf" anyone know if this is possible ? You can use --properties

ClassNotFoundException scala.runtime.LambdaDeserialize when spark-submit

≡放荡痞女 提交于 2019-11-29 15:09:33
I follow the Scala tutorial on https://spark.apache.org/docs/2.1.0/quick-start.html My scala file /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "/data/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains

How to save a file on the cluster

断了今生、忘了曾经 提交于 2019-11-29 13:18:23
I'm connected to the cluster using ssh and I send the program to the cluster using spark-submit --master yarn myProgram.py I want to save the result in a text file and I tried using the following lines: counts.write.json("hdfs://home/myDir/text_file.txt") counts.write.csv("hdfs://home/myDir/text_file.csv") However, none of them work. The program finishes and I cannot find the text file in myDir . Do you have any idea how can I do this? Also, is there a way to write directly to my local machine? EDIT: I found out that home directory doesn't exist so now I save the result as: counts.write.json(

Spark standalone connection driver to worker

北慕城南 提交于 2019-11-29 12:50:55
I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration master on machine 1 (port 7077 exposed) worker on machine 1 driver on machine 2 I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText() But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :

How to execute spark submit on amazon EMR from Lambda function?

核能气质少年 提交于 2019-11-28 18:29:11
I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function. Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step. You can, I had to same thing last week! Using boto3 for Python (other languages would definitely have a similar solution) you can either start a cluster with the defined step, or attach a

ClassNotFoundException scala.runtime.LambdaDeserialize when spark-submit

假装没事ソ 提交于 2019-11-28 08:33:26
问题 I follow the Scala tutorial on https://spark.apache.org/docs/2.1.0/quick-start.html My scala file /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "/data/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val

Add jars to a Spark Job - spark-submit

泄露秘密 提交于 2019-11-26 00:39:17
问题 True ... it has been discussed quite a lot. However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options. The ambiguous and/or omitted details Following ambiguity, unclear, and/or omitted details should be clarified for each option: How ClassPath is affected Driver Executor (for tasks running) Both not at all Separation character: comma, colon, semicolon If provided files are automatically

How to stop INFO messages displaying on spark console?

半世苍凉 提交于 2019-11-26 00:27:13
问题 I\'d like to stop various messages that are coming on spark shell. I tried to edit the log4j.properties file in order to stop these message. Here are the contents of log4j.properties # Define the root logger with appender file log4j.rootCategory=WARN, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: