Oozie job won't run if using PySpark in SparkAction

后端 未结 4 2083
陌清茗
陌清茗 2021-02-11 09:21

I\'ve encountered several examples of SparkAction jobs in Oozie, and most of them are in Java. I edit a little and run the example in Cloudera CDH Quickstart 5.4.0 (with Spark v

4条回答
  •  遥遥无期
    2021-02-11 09:51

    You should try configure the Oozie Spark action to bring needed files locally. You can make it using a file tag:

    
            ${resourceManager}
            ${nameNode}
            local[2]
            client
            ${name}
            my_pyspark_job.py
            {path to your file on hdfs}/my_pyspark_job.py#my_pyspark_job.py
        
    

    Explanation: Oozie action running inside YARN container which is allocated by YARN on the node which has available resources. Before running the action (which is actually a "driver" code) it copies all needed files (jars for example) locally to the node into folder allocated for YARN container to put its resources. So by adding tag to oozie action you "telling" your oozie action to bring the my_pyspark_job.py locally to the node of execution.

    In my case I want to run a bash script (run-hive-partitioner.bash) which will run a python code (hive-generic-partitioner.py), so I need all files locally accessible on the node:

    
      
        ${jobTracker}
        ${nameNode}
        ${appPath}/run-hive-partitioner.bash
             ${db}
             ${tables}
             ${base_working_dir}
        ${appPath}/run-hive-partitioner.bash#run-hive-partitioner.bash
        ${appPath}/hive-generic-partitioner.py#hive-generic-partitioner.py
         ${appPath}/util.py#util.py     
      
      
      
    
    

    where ${appPath} is hdfs://ci-base.com:8020/app/oozie/util/wf-repair_hive_partitions

    so this is what I get in my job:

    Files in current dir:/hadoop/yarn/local/usercache/hdfs/appcache/application_1440506439954_3906/container_1440506439954_3906_01_000002/
    
    ======================
    File: hive-generic-partitioner.py
    File: util.py
    File: run-hive-partitioner.bash
    ...
    File: job.xml
    File: json-simple-1.1.jar
    File: oozie-sharelib-oozie-4.1.0.2.2.4.2-2.jar
    File: launch_container.sh
    File: oozie-hadoop-utils-2.6.0.2.2.4.2-2.oozie-4.1.0.2.2.4.2-2.jar
    

    As you can see it oozie (or actually yarn I think) shipped all needed files locally to the temp folder and now it's able to run it.

提交回复
热议问题