Automate data processing workflow with Makefile on cluster
Recently I automated a few data processing workflows using Makefile. One complication is that my jobs run through a job control system without blocking calls. This means
- running
make
once is not enough since it exits at the first non-blocking call - running
make
repeatedly is wasteful since any unfinished job will be re-submitted
Employing sentinel files is my solution. Overall it works, and takes little time to write. The maintenance cost is yet to be seen.
Figure 1: A generic workflow. Each node is a file, and the arrows originating from (or termimating at) the same node denote the same script call (same job).
pattern 1: output is well defined
This is the simplest situation. For example, the a->b
or c->d
layers in Figure 1.
In this case, we can use an empty target as sentinel:
only run the job if output
doesn’t exist.
The empty output
signals either a running job or a
failed job, preventing re-submission to occur.
output : input
[ -f $@ -o ! -s $< ] || { run @< && touch $@; }
Note that -s @<
checks if input
is a merely sentinel.
Omit if it is guaranteed to exist.
Equivalently, one can use the shell if
statement
output : input
if [ ! -f $@ -a -s $< ]; then \
run $< && touch $@; fi
If the script gives rise to multiple well-defined output files, for example
the f->g
layer in Firgure 1 (1 input to 2 outputs),
we can use one of them as output
.
This shortcut ignores the edge case where the job dies right after generating output
.
One workaround is to use the last file written to disk as output
, if possible.
Otherwise use pattern 2 in the next section.
A related trick for pattern 1 is the Makefile pattern rule:
%output : %input
So we don’t need to specify the dependencies explicitly.
Make sure to make them .PRECIOUS
, see Chains of Implicit Rules.
The find
function is useful to distinguish finished targets from sentinels
done:=$(shell find . -mindepth 1 -maxdepth 1 -size +0 -name '*_my_pattern' -type f)
pattern 2: output is not well defined
Sometimes the number of output files is not pre-determined, see the d->e
layer
in Figure 1 for example.
In this case, we can employ 2 sentinels:
a input.running
created at job submission,
and a input.done
created at job completion
(the run @<
script writes it to disk).
input.done : input
[ -f $@ -o -f $<.running -o ! -s @< ] || { run @< && touch $<.running; }
Supporse there are a series of such inputs
, it may be useful to know if all
of them have finished.
inputs_done:=$(wildcard *.done)
ifeq ($(word $(inputs)),$(words inputs_done))
...
endif
We can further figure out which ones are not done
inputs_done_expect:=$(addsuffix .done,$(inputs))
not_done:=$(filter-out $(inputs_done),$(inputs_done_expect))
A sentinel for job completion is useful even when the output is well defined. Suppose the output file is large and takes time to write to disk, or it writes to disk by chunks, a non-zero size doesn’t guarantee job completion.
pattern 3: untouchable target
Sometimes the output is well-defined but it cannot be touched, making pattern 1 unsuitable. For example, some of my scripts use a directory as their output, and they don’t run if the directory pre-exists. In this case, we can use a running sentinel:
output : input
if [ ! -f $@.running -a -s $< ]; then \
run $< && touch $@.running; fi
It can be tricky to know whether the job truly completes in this case. If in doubt, use pattern 2.
pattern 4: waiting for all prerequisites
When multiple prerequisites exist,
for example the e->f
or g->h
layers in Figure 1,
we need to wait for all of them to be ready.
Suppose $(inputs)
contains all the prerequisites, we must ensure that
they are not sentinels
output : inputs
if (for x in $^;do [ -s $$x ] || exit; done); then \
run $< && touch $@; fi
However, this rule only works if inputs
is statically determined. If the
prerequisites are dynamically determined, say from
inputs:=$(wildcard *_my_pattern.abc)
extra check is needed.
The minimal check is
ifdef inputs
endif
which ensures $(inputs)
is not empty.
The better check should examine the completion of all the scripts that generate the prerequisites. See examples in the pattern 2 section.
clean up
To remove all the sentinels, use
clean:
find . -empty -delete
It deletes all empty files and directories for all subdirectories.
This triggers rerun for pattern 2 rules. For rules following pattern 1 and 3,
we need to clean up (rm -rf
) the targets as well.
appendix
- A sentinel file without a running job indicates a failed job. To save human
debugging effort, we must also create a
make
phony target to display the job/workflow status. - If all jobs can run in blocking mode, all the sentinel tricks in this post are unnecessary. And it becomes trivial to hook up the workflow.