问题
This is my second attempt at understanding how to pass arguments to dependencies in Luigi. The first one was here.
The idea is: I have TaskC
which depends on TaskB
, which depends on TaskA
, which depends on Task0
. I want this whole sequence to be exactly the same always, except I want to be able to control what file Task0
reads from, lets call it path
. Luigi's philosophy is normally that each task should only know about the Tasks it depends on, and their parameters. The problem with this is that TaskC
, TaskB
, and TaskA
all would have to accept variable path
for the sole purpose of then passing it to Task0
.
So, the solution that Luigi provides for this is called Configuration Classes
Here's some example code:
from pathlib import Path
import luigi
from luigi import Task, TaskParameter, IntParameter, LocalTarget, Parameter
class config(luigi.Config):
path = Parameter(default="defaultpath.txt")
class Task0(Task):
path = Parameter(default=config.path)
arg = IntParameter(default=0)
def run(self):
print(f"READING FROM {self.path}")
Path(self.output().path).touch()
def output(self): return LocalTarget(f"task0{self.arg}.txt")
class TaskA(Task):
arg = IntParameter(default=0)
def requires(self): return Task0(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskA{self.arg}.txt")
class TaskB(Task):
arg = IntParameter(default=0)
def requires(self): return TaskA(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskB{self.arg}.txt")
class TaskC(Task):
arg = IntParameter(default=0)
def requires(self): return TaskB(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskC{self.arg}.txt")
(Ignore all the output
and run
stuff. They're just there so the example runs successfully.)
The point of the above example is controlling the line print(f"READING FROM {self.path}")
without having tasks A, B, C depend on path
.
Indeed, with Configuration Classes I can control the Task0
argument. If Task0
is not passed a path
parameter, it takes its default value, which is config().path
.
My problem now is that this appears to me to work only at "build time", when the interpreter first loads the code, but not at run time (the details aren't clear to me).
So neither of these work:
A)
if __name__ == "__main__":
for i in range(3):
config.path = f"newpath_{i}"
luigi.build([TaskC(arg=i)], log_level="INFO")
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 4 ran successfully:
- 1 Task0(path=defaultpath.txt, arg=2)
- 1 TaskA(arg=2)
- 1 TaskB(arg=2)
- 1 TaskC(arg=2)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
I'm not sure why this doesn't work.
B)
if __name__ == "__main__":
for i in range(3):
luigi.build([TaskC(arg=i), config(path=f"newpath_{i}")], log_level="INFO")
===== Luigi Execution Summary =====
Scheduled 5 tasks of which:
* 5 ran successfully:
- 1 Task0(path=defaultpath.txt, arg=2)
- 1 TaskA(arg=2)
- 1 TaskB(arg=2)
- 1 TaskC(arg=2)
- 1 config(path=newpath_2)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
This actually makes sense. There's two config
classes, and I only managed to change the path
of one of them.
Help?
EDIT: Of course, having path
reference a global variable works, but then it's not a Parameter in the usual Luigi sense.
EDIT2: I tried point 1) of the answer below:
config
has the same definition
class config(luigi.Config):
path = Parameter(default="defaultpath.txt")
I fixed mistake pointed out, i.e. Task0
is now:
class Task0(Task):
path = Parameter(default=config().path)
arg = IntParameter(default=0)
def run(self):
print(f"READING FROM {self.path}")
Path(self.output().path).touch()
def output(self): return LocalTarget(f"task0{self.arg}.txt")
and finally I did:
if __name__ == "__main__":
for i in range(3):
config.path = Parameter(f"file_{i}")
luigi.build([TaskC(arg=i)], log_level="WARNING")
This doesn't work, Task0
still gets path="defaultpath.txt"
.
回答1:
So what you're trying to do is create tasks with params without passing these params to the parent class. That is completely understandable, and I have been annoyed at times in trying to handle this.
Firstly, you are using the config
class incorrectly. When using a Config class, as noted in https://luigi.readthedocs.io/en/stable/configuration.html#configuration-classes, you need to instantiate the object. So, instead of:
class Task0(Task):
path = Parameter(default=config.path)
...
you would use:
class Task0(Task):
path = Parameter(default=config().path)
...
While this now ensures you are using a value and not a Parameter
object, it still does not solve your problem. When creating the class Task0
, config().path
would be evaluated, therefore it's not assigning the reference of config().path
to path
, but instead the value when called (which will always be defaultpath.txt
). When using the class in the correct manner, luigi will construct a Task
object with only luigi.Parameter
attributes as the attribute names on the new instance as seen here: https://github.com/spotify/luigi/blob/master/luigi/task.py#L436
So, I see two possible paths forward.
1.) The first is to set the config path at runtime like you had, except set it to be a Parameter
object like this:
config.path = luigi.Parameter(f"newpath_{i}")
However, this would take a lot of work to get your tasks using config.path
working as now they need to take in their parameters differently (can't be evaluated for defaults when the class is created).
2.) The much easier way is to simply specify the arguments for your classes in the config file. If you look at https://github.com/spotify/luigi/blob/master/luigi/task.py#L825, you'll see that the Config
class in Luigi, is actually just a Task
class, so you can anything with it you could do with a class and vice-versa. Therefore, you could just have this in your config file:
[Task0]
path = newpath_1
...
3.) But, since you seem to be wanting to run multiple tasks with the different arguments for each, I would just recommend passing in args through the parents as Luigi encourages you to do. Then you could run everything with:
luigi.build([TaskC(arg=i) for i in range(3)])
4.) Finally, if you really need to get rid of passing dependencies, you can create a ParamaterizedTaskParameter
that extends luigi.ObjectParameter
and uses the pickle of a task instance as the object.
Of the above solutions, I highly suggest either 2 or 3. 1 would be difficult to program around, and 4 would create some very ugly parameters and is a bit more advanced.
Edit: Solutions 1 and 2 are more of hacks than anything, and it is just recommended that you bundle parameters in DictParameter
.
来源:https://stackoverflow.com/questions/64958830/luigi-how-to-pass-different-arguments-to-leaf-tasks