ducttape sometimes-skip task: cross-product error

大憨熊 提交于 2019-12-11 03:00:32


I'm trying a variant of sometimes-skip tasks for ducttape, based on the tutorial here: http://nschneid.github.io/ducttape-crash-course/tutorial5.html

([ducttape][1] is a Bash/Scala based workflow management tool.)

I'm trying to do a cross-product to execute task1 on "clean" data and "dirty" data. The idea is to traverse the same path, but without preprocessing in some cases. To do this, I need to do a cross-product of tasks.

task cleanup < in=(Dirty: a=data/a b=data/b) > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out

global {
    data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))

task task1 < in=$data > out 
    cat $in > $out

plan FinalTasks {
    reach task1 via (Dirty: *) * (Data: *) * (Clean: *)

Here is the execution plan. I would expect 6 tasks, but I have two duplicate tasks being executed.

$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n] 

Removing the symlinks from the output below, my duplicates are here:

$ head task1/*/out
==> Baseline.baseline/out <==

==> Clean.b+Data.clean/out <==
==> Data.clean/out <==

==> Clean.b+Data.clean+Dirty.b/out <==
==> Data.clean+Dirty.b/out <==

==> Dirty.b/out <==

Could someone with experience with ducttape assist me in finding my cross-product problem?

  [1]: https://github.com/jhclark/ducttape


So why do we have 4 realizations involving the branch point Clean at task1 instead of just two?

The answer to this question is that the in ducttape branch points are always propagated through all transitive dependencies of a task. So the branch point "Dirty" from the task "cleanup" is propagated through clean=(Clean: a=$out@cleanup b=$out@cleanup). At this point the variable "clean" contains the cross product of the original "Dirty" and the newly-introduced "Clean" branch point.

The minimal change to make is to change

clean=(Clean: a=$out@cleanup b=$out@cleanup)



This would give you the desired number of realizations, but it's a bit confusing to use the branch point name "Dirty" just to control which input data set you're using -- with only this minimal change, the two realizations of the task "cleanup" would be (Dirty: a b).

It may make your workflow even more grokkable to refactor it like this:

global {
    raw_data=(DataSet: a=data/a b=data/b)

task cleanup < in=$raw_data > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
global {
    ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)

task task1 < in=$ready_data > out 
    cat $in > $out

plan FinalTasks {
    reach task1 via (DataSet: *) * (DoCleanup: *)

