I have already done with spark installation and executed few testcases setting master and worker nodes. That said, I have a very fat confusion of what exactly a job is meant in
Hey here's something I did before, hope it works for you:
#!/bin/bash
# Hadoop and Server Variables
HADOOP="hadoop fs"
HDFS_HOME="hdfs://ha-edge-group/user/max"
LOCAL_HOME="/home/max"
# Cluster Variables
DRIVER_MEM="10G"
EXECUTOR_MEM="10G"
CORES="5"
EXECUTORS="15"
# Script Arguments
SCRIPT="availability_report.py" # Arg[0]
APPNAME="Availability Report" # arg[1]
DAY=`date -d yesterday +%Y%m%d`
for HOUR in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
do
#local directory to getmerge to
LOCAL_OUTFILE="$LOCAL_HOME/availability_report/data/$DAY/$HOUR.txt"
# Script arguments
HDFS_SOURCE="webhdfs://1.2.3.4:0000/data/lbs_ndc/raw_$DAY'_'$HOUR" # arg[2]
HDFS_CELLS="webhdfs://1.2.3.4:0000/data/cells/CELLID_$DAY.txt" # arg[3]
HDFS_OUT_DIR="$HDFS_HOME/availability/$DAY/$HOUR" # arg[4]
spark-submit \
--master yarn-cluster \
--driver-memory $DRIVER_MEM \
--executor-memory $EXECUTOR_MEM \
--executor-cores $CORES \
--num-executors $EXECUTORS \
--conf spark.scheduler.mode=FAIR \
$SCRIPT $APPNAME $HDFS_SOURCE $HDFS_CELLS $HDFS_OUT_DIR
$HADOOP -getmerge $HDFS_OUT_DIR $LOCAL_OUTFILE
done
Well, terminology can always be difficult since it depends on context. In many cases, you can be used to "submit a job to a cluster", which for spark would be to submit a driver program.
That said, Spark has his own definition for "job", directly from the glossary:
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
So I this context, let's say you need to do the following:
So,
Hope it makes things clearer ;-)