Automate data processing workflow with Makefile on cluster

Recently I automated a few data processing workflows using Makefile. One complication is that my jobs run through a job control system without blocking calls. This means

running make once is not enough since it exits at the first non-blocking call
running make repeatedly is wasteful since any unfinished job will be re-submitted

Employing sentinel files is my solution. Overall it works, and takes little time to write. The maintenance cost is yet to be seen.

Figure 1: A generic workflow. Each node is a file, and the arrows originating from (or termimating at) the same node denote the same script call (same job).

pattern 1: output is well defined

This is the simplest situation. For example, the a->b or c->d layers in Figure 1.

In this case, we can use an empty target as sentinel: only run the job if output doesn’t exist. The empty output signals either a running job or a failed job, preventing re-submission to occur.

output : input
    [ -f $@ -o ! -s $< ] || { run @< && touch $@; }

Note that -s @< checks if input is a merely sentinel. Omit if it is guaranteed to exist.

Equivalently, one can use the shell if statement

output : input
    if [ ! -f $@ -a -s $< ]; then \
        run $< && touch $@; fi

If the script gives rise to multiple well-defined output files, for example the f->g layer in Firgure 1 (1 input to 2 outputs), we can use one of them as output. This shortcut ignores the edge case where the job dies right after generating output. One workaround is to use the last file written to disk as output, if possible. Otherwise use pattern 2 in the next section.

A related trick for pattern 1 is the Makefile pattern rule:

%output : %input

So we don’t need to specify the dependencies explicitly. Make sure to make them .PRECIOUS, see Chains of Implicit Rules.

The find function is useful to distinguish finished targets from sentinels

done:=$(shell find . -mindepth 1 -maxdepth 1 -size +0 -name '*_my_pattern' -type f)

pattern 2: output is not well defined

Sometimes the number of output files is not pre-determined, see the d->e layer in Figure 1 for example. In this case, we can employ 2 sentinels: a input.running created at job submission, and a input.done created at job completion (the run @< script writes it to disk).

input.done :  input
    [ -f $@ -o -f $<.running -o ! -s @< ] || { run @< && touch $<.running; }

Supporse there are a series of such inputs, it may be useful to know if all of them have finished.

inputs_done:=$(wildcard *.done)
ifeq ($(word $(inputs)),$(words inputs_done))
...
endif

We can further figure out which ones are not done

inputs_done_expect:=$(addsuffix .done,$(inputs))
not_done:=$(filter-out $(inputs_done),$(inputs_done_expect))

A sentinel for job completion is useful even when the output is well defined. Suppose the output file is large and takes time to write to disk, or it writes to disk by chunks, a non-zero size doesn’t guarantee job completion.

pattern 3: untouchable target

Sometimes the output is well-defined but it cannot be touched, making pattern 1 unsuitable. For example, some of my scripts use a directory as their output, and they don’t run if the directory pre-exists. In this case, we can use a running sentinel:

output : input
    if [ ! -f $@.running -a -s $< ]; then \
        run $< && touch $@.running; fi

It can be tricky to know whether the job truly completes in this case. If in doubt, use pattern 2.

pattern 4: waiting for all prerequisites

When multiple prerequisites exist, for example the e->f or g->h layers in Figure 1, we need to wait for all of them to be ready.

Suppose $(inputs) contains all the prerequisites, we must ensure that they are not sentinels

output : inputs
	if (for x in $^;do [ -s $$x ] || exit; done); then \
        run $< && touch $@; fi

However, this rule only works if inputs is statically determined. If the prerequisites are dynamically determined, say from

inputs:=$(wildcard *_my_pattern.abc)

extra check is needed.

The minimal check is

ifdef inputs
endif

which ensures $(inputs) is not empty.

The better check should examine the completion of all the scripts that generate the prerequisites. See examples in the pattern 2 section.

clean up

To remove all the sentinels, use

clean:
    find . -empty -delete

It deletes all empty files and directories for all subdirectories.

This triggers rerun for pattern 2 rules. For rules following pattern 1 and 3, we need to clean up (rm -rf) the targets as well.

appendix

A sentinel file without a running job indicates a failed job. To save human debugging effort, we must also create a make phony target to display the job/workflow status.
If all jobs can run in blocking mode, all the sentinel tricks in this post are unnecessary. And it becomes trivial to hook up the workflow.