Abusing Make to Create Shell Pipelines on Steroids
Make
is a well-known tool for transforming input files, like source
code, into “targets,” like runnable programs, based on a set of rules
defined in a Makefile
. Typically, Make
is used to transform a
predefined set of inputs into a target. This post presents an
alternative way of using (or, really, abusing) Make
to define and
execute arbitrarily-long sequences of data transformations. We’ll do
this by essentially interpreting Make
targets as functions that can
be chained together, similar to how shell pipelines chain together
commands. Compared to shell pipelines, work can be shared across
repeated pipeline executions by taking advantage of the fact Make
doesn’t rebuild existing intermediate files unnecessarily.
Background: Make
and Make
Rules
A typical Makefile
for generating a binary from source code has a
very simple structure: first, the Makefile
describes how to create
object files from each source file, then it describes how to combine
all the object files into the final binary. These descriptions are in
the form of pattern-based rules like the following:
%.o: %.c
clang -c $< -o $@
This rule says: “to create the object file X.o
, first ensure X.c
exists, then run clang -c X.c -o X.o
.” In the rule, the variables
$@
and $<
stand for file names constructed by pattern matching:
$@
stands for the target to create (X.o
) and $<
stands for the
rule’s input (the file named X.c
, which is formed by matching the
rule’s target, %.o
, with X.o
, so %
is replaced by X
). Note
that the “ensure X.c
exists” step will require running additional
rules from the Makefile
if X.c
itself is the output of another
transformation, like the application of a parser generator to a
grammar specification.
Defining and Composing Functions with Make
A Makefile
rule describing how to produce a .o
file from a .c
file defines a function from .o
files to .c
files. This function
can be explicitly invoked by passing the target to the make
command:
make object.o
In this function invocation, the input to the function, object.c
, is
implicit, since it’s determined automatically by the rule for
producing object.o
.
We can define Make
rules for applying arbitrary functions to
arbitrary inputs. For example, this rule takes any file f
and
produces a file f.words
consisting of all the words in f
:
%.words: %
< $< tr -d '[:punct:]' | tr '\n\r' ' ' | tr -s '[:space:]' | \
tr '[:space:]' '\n' > $@
The tr
commands remove punctuation, replace newlines and carriage
returns by spaces, collapse multiple spaces into one, and then
translate the remaining spaces into newlines.
If we have the text of Moby Dick in file moby-dick
, we can output
all the words in the text, one per line, in the order they appear, in
the file moby-dick.words
with:
make moby-dick.words
That is, we apply the words
function to the input moby-dick
to
produce moby-dick.words
.
We can chain function invocations together by adding more suffixes.
Let’s define a function, as a Make
rule, that sorts the lines of a
file:
%.sorted: %
< $< sort > $@
We can now compose the words
and sorted
functions together to
produce a sorted list of the words in moby-dick
in
moby-dick.words.sorted
:
make moby-dick.words.sorted
We can add a unique
function to compute the unique words from an
already-sorted input:
%.sorted.unique: %.sorted
< $< uniq > $@
Note the sorted
on the prerequisite side of the rule: the uniq
command requires its input to be sorted.
Let’s define a few more useful functions:
%.lowercase: %
< $< tr '[:upper:]' '[:lower:]' > $@
%.sorted.unique.count: %.sorted
< $< uniq -c > $@
%.descending-numeric-sorted: %
< $< sort -n -r > $@
With these additional function definitions, we can get the list of all unique words in Moby Dick, after lowercasing, in descending order of frequency, with the following:
make moby-dick.words.lowercase.sorted.unique.count.descending-numeric-sorted
Note that pattern matching matches against the most-specific rule, so
the sorted.unique.count
rule takes precedence over the
sorted.unique
rule; this ensures that we apply the right uniq
command to get the unique words in the sorted word list, along with
their frequencies.
(Not to spoil the book for anybody, but “whale” is pretty high up there.)
Function Memoization and Ad-Hoc Data Updates
One of the defining attributes of Make
is that it doesn’t repeat any
work—that is, it doesn’t recreate files that are up to date. If you
already created moby-dick.words.sorted
, and neither of the input
files moby-dick
nor moby-dick.words
has changed since
moby-dick.words.sorted
was created, then creating
moby-dick.words.sorted.unique
reuses the existing
moby-dick.words.sorted
without recreating any files. That is,
Make
can memoize our
functions.
However, since Make
deletes intermediate files after each run, you
don’t get memoization automatically; instead, you have to explicitly
ask for memoization by either creating the outputs you want to memoize
(by running the appropriate Make
command) or by declaring the
outputs you want to memoize using, say, SECONDARY
or PRECIOUS
targets,
assuming you’re using GNU Make
.
Comparison to Shell Pipelines
What I’ve described above is basically shell pipelines: we have a
number of operations we compose together in ad-hoc ways to transform
data. Further, shell pipelines have the advantages that we don’t need
to define anything ahead of time, all operations in a pipeline run in
parallel, and data stay in memory rather than being written to disk.
Given the many advantages of using ordinary shell pipelines, and that
is there any reason to use Make
for essentially the same purpose, or
is it just a neat trick?
I think there are a few cases where this might be practically useful:
- When you’re incrementally creating a pipeline, the memoizing aspect
of
Make
makes it convenient to explore without repeating expensive computations. - When you need to do ad-hoc cleaning to data in the middle of a
pipeline, it’s easy to just chop off the last few functions from the
Make
invocation, do the cleanup on the resulting output, and then re-run the full, original pipeline, which will resume from the point where you made the cleanup. - When you’re generating multiple related final outputs from the same
intermediate data,
Make
can help you avoid repeating expensive steps.
Thanks to Daniel B. Smith for valuable comments and suggestions on a draft of this post.