I\'ve encountered several examples of SparkAction jobs in Oozie, and most of them are in Java. I edit a little and run the example in Cloudera CDH Quickstart 5.4.0 (with Spark v
I too struggled a lot with the spark-action in oozie. I setup the sharelib properly and tried to pass the the appropriate jars using the --jars option within the <spark-opts> </spark-opts>
tags, but to no avail.
I always ended up getting some error or the other. The most I could do was run all java/python spark jobs in local mode through the spark-action.
However, I got all my spark jobs running in oozie in all modes of execution using the shell action. The major problem with the shell action is that shell jobs are deployed as the 'yarn' user. If you happen to deploy your oozie spark job from a user account other than yarn, you'll end up with a Permission Denied error (because the user would not be able to access the spark assembly jar copied into /user/yarn/.SparkStaging directory). The way to solve this is to set the HADOOP_USER_NAME environment variable to the user account name through which you deploy your oozie workflow.
Below is a workflow that illustrates this configuration. I deploy my oozie workflows from the ambari-qa user.
<workflow-app xmlns="uri:oozie:workflow:0.4" name="sparkjob">
<start to="spark-shell-node"/>
<action name="spark-shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapred.job.queue.name</name>
<value>launcher2</value>
</property>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/ambari-qa/sparkActionPython/hive-site.xml</value>
</property>
</configuration>
<exec>/usr/hdp/current/spark-client/bin/spark-submit</exec>
<argument>--master</argument>
<argument>yarn-cluster</argument>
<argument>wordcount.py</argument>
<env-var>HADOOP_USER_NAME=ambari-qa</env-var>
<file>/user/ambari-qa/sparkActionPython/wordcount.py#wordcount.py</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="spark-fail"/>
</action>
<kill name="spark-fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Hope this helps!
You should try configure the Oozie Spark action to bring needed files locally. You can make it using a file tag:
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local[2]</master>
<mode>client</mode>
<name>${name}</name>
<jar>my_pyspark_job.py</jar>
<file>{path to your file on hdfs}/my_pyspark_job.py#my_pyspark_job.py</file>
</spark>
Explanation: Oozie action running inside YARN container which is allocated by YARN on the node which has available resources. Before running the action (which is actually a "driver" code) it copies all needed files (jars for example) locally to the node into folder allocated for YARN container to put its resources. So by adding tag to oozie action you "telling" your oozie action to bring the my_pyspark_job.py locally to the node of execution.
In my case I want to run a bash script (run-hive-partitioner.bash) which will run a python code (hive-generic-partitioner.py), so I need all files locally accessible on the node:
<action name="repair_hive_partitions">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>${appPath}/run-hive-partitioner.bash</exec>
<argument>${db}</argument>
<argument>${tables}</argument>
<argument>${base_working_dir}</argument>
<file>${appPath}/run-hive-partitioner.bash#run-hive-partitioner.bash</file>
<file>${appPath}/hive-generic-partitioner.py#hive-generic-partitioner.py</file>
<file>${appPath}/util.py#util.py</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
where ${appPath} is hdfs://ci-base.com:8020/app/oozie/util/wf-repair_hive_partitions
so this is what I get in my job:
Files in current dir:/hadoop/yarn/local/usercache/hdfs/appcache/application_1440506439954_3906/container_1440506439954_3906_01_000002/
======================
File: hive-generic-partitioner.py
File: util.py
File: run-hive-partitioner.bash
...
File: job.xml
File: json-simple-1.1.jar
File: oozie-sharelib-oozie-4.1.0.2.2.4.2-2.jar
File: launch_container.sh
File: oozie-hadoop-utils-2.6.0.2.2.4.2-2.oozie-4.1.0.2.2.4.2-2.jar
As you can see it oozie (or actually yarn I think) shipped all needed files locally to the temp folder and now it's able to run it.
I was able to "fix" this issue although it leads to another issue. Nonetheless, I will still post it.
In stderr of the Oozie container logs, it shows:
Error: Only local python files are supported
And I found a solution here
This is my initial workflow.xml:
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local[2]</master>
<mode>client</mode>
<name>${name}</name>
<jar>my_pyspark_job.py</jar>
</spark>
What I did initially was to copy to HDFS the Python script I wish to run as spark-submit job. It turns out that it expects the .py script in the local file system, so I what I did was to refer to the absolute local file system of my script.
<jar>/<absolute-local-path>/my_pyspark_job.py</jar>
We were getting same error. If you try to drop spark-assembly jar from '/path/to/spark-install/lib/spark-assembly*.jar' (depends upon distribution) to your oozie.wf.application.path/lib
dir along side your application jar it should work.