Abusing Make to Create Shell Pipelines on Steroids

April 3, 2022

Make is a well-known tool for transforming input files, like source code, into “targets,” like runnable programs, based on a set of rules defined in a Makefile. Typically, Make is used to transform a predefined set of inputs into a target. This post presents an alternative way of using (or, really, abusing) Make to define and execute arbitrarily-long sequences of data transformations. We’ll do this by essentially interpreting Make targets as functions that can be chained together, similar to how shell pipelines chain together commands. Compared to shell pipelines, work can be shared across repeated pipeline executions by taking advantage of the fact Make doesn’t rebuild existing intermediate files unnecessarily.

Background: `Make` and `Make` Rules

A typical Makefile for generating a binary from source code has a very simple structure: first, the Makefile describes how to create object files from each source file, then it describes how to combine all the object files into the final binary. These descriptions are in the form of pattern-based rules like the following:

%.o: %.c
	clang -c $< -o $@

This rule says: “to create the object file X.o, first ensure X.c exists, then run clang -c X.c -o X.o.” In the rule, the variables $@ and $< stand for file names constructed by pattern matching: $@ stands for the target to create (X.o) and $< stands for the rule’s input (the file named X.c, which is formed by matching the rule’s target, %.o, with X.o, so % is replaced by X). Note that the “ensure X.c exists” step will require running additional rules from the Makefile if X.c itself is the output of another transformation, like the application of a parser generator to a grammar specification.

Defining and Composing Functions with `Make`

A Makefile rule describing how to produce a .o file from a .c file defines a function from .o files to .c files. This function can be explicitly invoked by passing the target to the make command:

make object.o

In this function invocation, the input to the function, object.c, is implicit, since it’s determined automatically by the rule for producing object.o.

We can define Make rules for applying arbitrary functions to arbitrary inputs. For example, this rule takes any file f and produces a file f.words consisting of all the words in f:

%.words: %
	< $< tr -d '[:punct:]' | tr '\n\r' ' ' | tr -s '[:space:]' | \
        tr '[:space:]' '\n' > $@

The tr commands remove punctuation, replace newlines and carriage returns by spaces, collapse multiple spaces into one, and then translate the remaining spaces into newlines.

If we have the text of Moby Dick in file moby-dick, we can output all the words in the text, one per line, in the order they appear, in the file moby-dick.words with:

make moby-dick.words

That is, we apply the words function to the input moby-dick to produce moby-dick.words.

We can chain function invocations together by adding more suffixes. Let’s define a function, as a Make rule, that sorts the lines of a file:

%.sorted: %
	< $< sort > $@

We can now compose the words and sorted functions together to produce a sorted list of the words in moby-dick in moby-dick.words.sorted:

make moby-dick.words.sorted

We can add a unique function to compute the unique words from an already-sorted input:

%.sorted.unique: %.sorted
	< $< uniq > $@

Note the sorted on the prerequisite side of the rule: the uniq command requires its input to be sorted.

Let’s define a few more useful functions:

%.lowercase: %
	< $< tr '[:upper:]' '[:lower:]' > $@

%.sorted.unique.count: %.sorted
	< $< uniq -c > $@

%.descending-numeric-sorted: %
	< $< sort -n -r > $@

With these additional function definitions, we can get the list of all unique words in Moby Dick, after lowercasing, in descending order of frequency, with the following:

make moby-dick.words.lowercase.sorted.unique.count.descending-numeric-sorted

Note that pattern matching matches against the most-specific rule, so the sorted.unique.count rule takes precedence over the sorted.unique rule; this ensures that we apply the right uniq command to get the unique words in the sorted word list, along with their frequencies.

(Not to spoil the book for anybody, but “whale” is pretty high up there.)

Function Memoization and Ad-Hoc Data Updates

One of the defining attributes of Make is that it doesn’t repeat any work—that is, it doesn’t recreate files that are up to date. If you already created moby-dick.words.sorted, and neither of the input files moby-dick nor moby-dick.words has changed since moby-dick.words.sorted was created, then creating moby-dick.words.sorted.unique reuses the existing moby-dick.words.sorted without recreating any files. That is, Make can memoize our functions.

However, since Make deletes intermediate files after each run, you don’t get memoization automatically; instead, you have to explicitly ask for memoization by either creating the outputs you want to memoize (by running the appropriate Make command) or by declaring the outputs you want to memoize using, say, SECONDARY or PRECIOUS targets, assuming you’re using GNU Make.

Comparison to Shell Pipelines

What I’ve described above is basically shell pipelines: we have a number of operations we compose together in ad-hoc ways to transform data. Further, shell pipelines have the advantages that we don’t need to define anything ahead of time, all operations in a pipeline run in parallel, and data stay in memory rather than being written to disk. Given the many advantages of using ordinary shell pipelines, and that is there any reason to use Make for essentially the same purpose, or is it just a neat trick?

I think there are a few cases where this might be practically useful:

When you’re incrementally creating a pipeline, the memoizing aspect of Make makes it convenient to explore without repeating expensive computations.
When you need to do ad-hoc cleaning to data in the middle of a pipeline, it’s easy to just chop off the last few functions from the Make invocation, do the cleanup on the resulting output, and then re-run the full, original pipeline, which will resume from the point where you made the cleanup.
When you’re generating multiple related final outputs from the same intermediate data, Make can help you avoid repeating expensive steps.

Thanks to Daniel B. Smith for valuable comments and suggestions on a draft of this post.

Background: Make and Make Rules

Defining and Composing Functions with Make

Function Memoization and Ad-Hoc Data Updates

Comparison to Shell Pipelines

Background: `Make` and `Make` Rules

Defining and Composing Functions with `Make`