Running Hadoop jar using Luigi python

谁都会走 提交于 2019-12-10 15:21:09

问题


I need to run a Hadoop jar job using Luigi from python. I searched and found examples of writing mapper and reducer in Luigi but nothing to directly run a Hadoop jar.

I need to run a Hadoop jar compiled directly. How can I do it?


回答1:


You need to use the luigi.contrib.hadoop_jar package (code).

In particular, you need to extend HadoopJarJobTask. For example, like that:

from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget

class TextExtractorTask(HadoopJarJobTask):
    def output(self):
        return HdfsTarget('data/processed/')

    def jar(self):
        return 'jobfile.jar'

    def main(self):
        return 'com.ololo.HadoopJob'

    def args(self):
        return ['--param1', '1', '--param2', '2']

You can also include building a jar file with maven to the workflow:

import luigi
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
from luigi.file import LocalTarget

import subprocess
import os

class BuildJobTask(luigi.Task):
    def output(self):
        return LocalTarget('target/jobfile.jar')

    def run(self):
        subprocess.call(['mvn', 'clean', 'package', '-DskipTests'])

class YourHadoopTask(HadoopJarJobTask):
    def output(self):
        return HdfsTarget('data/processed/')

    def jar(self):
        return self.input().fn

    def main(self):
        return 'com.ololo.HadoopJob'

    def args(self):
        return ['--param1', '1', '--param2', '2']

    def requires(self):
        return BuildJobTask()


来源:https://stackoverflow.com/questions/29801968/running-hadoop-jar-using-luigi-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!