How to reset luigi task status?

前端 未结 4 1235
清歌不尽
清歌不尽 2021-02-19 20:36

Currently, I have a bunch of luigi tasks queued together, with a simple dependency chain( a -> b -> c -> d). d gets executed first, and

相关标签:
4条回答
  • 2021-02-19 21:04

    I typically do this by overriding complete():

    class BaseTask(luigi.Task):
    
        force = luigi.BoolParameter()
    
        def complete(self):
            outputs = luigi.task.flatten(self.output())
            for output in outputs:
                if self.force and output.exists():
                    output.remove()
            return all(map(lambda output: output.exists(), outputs))
    
    
    class MyTask(BaseTask):
        def output(self):
            return luigi.LocalTarget("path/to/done/file.txt")
    
        def run(self):
            with self.output().open('w') as out_file:
                out_file.write('Complete')
    

    When you run the task, the output file is created as expected. Upon instantiating the class with force=True, the output file will still exist until complete() is called.

    task = MyTask()
    task.run()
    task.complete()
    # True
    
    new_task = MyTask(force=True)
    new_task.output().exists()
    # True
    new_task.complete()
    # False
    new_task.output().exists()
    # False
    
    0 讨论(0)
  • 2021-02-19 21:08

    First a comment: Luigi tasks are idempotent. if you run a task with the same parameter values, no matter how many times you run it, it must always return the same outputs. So it doesn't make sense to run it more than once. This makes Luigi powerful: if you have a big task that makes a lot of things an takes a lot of time and it fails somewhere, you'll have to run it again from the beginning. If you split it into smaller tasks, run it and it fails, you'll only have to run the rest of the tasks in the pipeline.

    When you run a task Luigi checks out the outputs of that task to see if they exist. If they don't, Luigi checks out the outputs of the tasks it depends on. If they exists, then it will only run the current task and generate the output Target. If the dependencies outputs doesn't exists, then it will run that tasks.

    So, if you want to rerun a task you must delete its Target outputs. And if you want to rerun the whole pipeline you must delete all the outputs of all the tasks that tasks depends on in cascade.

    There's an ongoing discussion in this issue in Luigi repository. Take a look at this comment since it will point you to some scripts for getting the output targets of a given task and removing them.

    0 讨论(0)
  • 2021-02-19 21:09

    d6tflow allows you to reset and force rerun of tasks, see details at https://d6tflow.readthedocs.io/en/latest/control.html#manually-forcing-task-reset-and-rerun.

    # force execution including downstream tasks
    d6tflow.run([TaskTrain()],force=[TaskGetData()])
    
    # reset single task
    TaskGetData().invalidate()
    
    # reset all downstream task output
    d6tflow.invalidate_downstream(TaskGetData(), TaskTrain())
    
    # reset all upstream task input
    d6tflow.invalidate_upstream(TaskTrain())
    

    Caveat: it only works for d6tflow tasks and targets, which are modified local targets, but not for all luigi targets. Should take you a long way and is optimized for data science workflows. Works well for local worker, haven't tested on central server.

    0 讨论(0)
  • 2021-02-19 21:15

    I use this to forcibly regenerate output without needing to remove it first, and allow you to select which types to regenerate. In our use case, we want the old generated files to continue to exist until they are rewritten with fresh versions.

    # generation.py
    class ForcibleTask(luigi.Task):
        force_task_families = luigi.ListParameter(
            positional=False, significant=False, default=[]
        )
    
        def complete(self):
            print("{}: check {}".format(self.get_task_family(), self.output().path))
            if not self.output().exists():
                self.oldinode = 0  # so any new file is considered complete
                return False
            curino = pathlib.Path(self.output().path).stat().st_ino
            try:
                x = self.oldinode
            except AttributeError:
                self.oldinode = curino
    
            if self.get_task_family() in self.force_task_families:
                # only done when file has been overwritten with new file
                return self.oldinode != curino
    
            return self.output().exists()
    

    Example usage

    class Generate(ForcibleTask):
        date = luigi.DateParameter()
        def output(self):
            return luigi.LocalTarget(
                self.date.strftime("generated-%Y-%m-%d")
            )
    

    invocation

    luigi --module generation Generate '--Generate-force-task-families=["Generate"]'
    
    0 讨论(0)
提交回复
热议问题