Diagrams for Composing Compilers

T-diagrams (or tombstone diagrams) are widely used by teachers to explain how compilers and interpreters can be composed together to build and execute software. In this blog post, Paul Brunet and I revisit these diagrams, and show how they can be redesigned for better readability. We demonstrate how they can be applied to explain compiler concepts including bootstrapping and cross-compilation. We provide a semantics for our redesigned diagrams, based on binary trees. Finally, we suggest how our diagrams could be used to analyse the performance of a compilation system.

Introduction

In introductory courses on compilers across the globe, students are taught about the interactions between compilers using ‘tombstone diagrams’ or ‘T-diagrams’. In this formalism, a compiler is represented as a ‘T-piece’. A T-piece, as shown below, is characterised by three labels: the source language of the compiler, the target language of the compiler, and the language in which the compiler is implemented.

A compiler from source language ‘src’ to target language ‘tgt’ implemented in language
‘imp’.

Complex networks of compilers can be represented by composing these basic pieces in two ways. Horizontal composition means that the output of the first compiler is fed in as the input to the second. Diagonal composition means that the first compiler is itself compiled using the second compiler. The pictures below give examples of both forms.

Left: Horizontal composition of T-pieces. Haskell is compiled to C using a compiler written in Java, and thence to ARM assembly using another compiler that is also written in Java. Right: Diagonal composition of T-pieces. An OCaml-to-x86 compiler written in C is being compiled using a C-to-x86 compiler written in x86 assembly.

In this blog post, we investigate the foundations of these diagrams. What are the rules that govern how they can be composed? Can we redesign them to make them more understandable? What do they mean in general? And what do they tell us about the compilers they depict?

Background

The usefulness of diagrams for explaining the connectivity between compilers can be traced back at least as far as the UNCOL project in the 1950s, which was an early effort to provide a ‘Universal Computer-Oriented Language’ that could interface with many different front-ends and back-ends. Diagrams such as the one reproduced below were used to show how compilers producing UNCOL could be composed with those that consume it.

A precursor to the T-diagram. An UNCOL-to-704 compiler (top left) running on an IBM 704 machine (middle) is used to compile an OTN-to-UNCOL compiler implemented in UNCOL (bottom left), which results in an OTN-to-UNCOL compiler implemented in 704 machine code (right).

Shortly thereafter, T-diagrams were proposed by Bratman as an improvement on the UNCOL diagrams. As an example, the diagram below shows Bratman’s reimagining of the diagram above.

The first T-diagram. This T-diagram depicts the same information as the previous diagram.

Three problems with T-diagrams are immediately apparent.

  1. When two T-pieces are diagonally composed, they meet at two interfaces, and it is not clear that only one of these interfaces is required to have matching languages. Diagonal composition only requires the ‘implementation’ language of the left piece to match the ‘source’ language of the right piece.
  2. T-diagrams typically do not distinguish the operands to the composition from the result of the composition. For example, in Bratman’s picture above, the right T-piece is actually the result of composing the left T-piece with the middle T-piece, but this relationship is not made clear by the diagram – all three pieces appear on an equal footing. (Of course, the obvious way to avoid this problem is simply not to show the result of the composition in the same diagram.)
  3. The symmetric shape of the T-pieces invites a third mode of composition, pictured below, that is not actually meaningful. (Note that this form of composition does seem to appear in Bratman’s picture above, but this is only because the operands and the result of composition are being conflated.)
Only one of the two joining interfaces requires its languages to match (Problem 1).
This form of composition looks legal but is actually not (Problem 3)

Over the next couple of decades, several other diagrammatic systems were proposed, as pictured below. For example, Burkhardt proposed replacing Bratman’s T-pieces with X-shaped pieces, Sklansky et al. suggested D-shaped pieces, Rosin used pieces with a variety of shapes, and Earley and Sturgis extended Bratman’s T-pieces with I-pieces that represent interpreters. However, none of them satisfactorily resolved all three problems identified above.

A Burkhardt diagram of a compiler that accepts programs in language ‘N’, generates programs in language ‘SPS’, and is implemented in IBM 709 machine code.
A diagram by Sklansky et al. Step 1 shows an algorithm q written in language L (q/L) being compiled by an L-to-M compiler implemented in M (L/MM) on a machine M, to produce an implementation of q in M (q/MM). In Step 2, this implementation is executed on machine M with data δ, to produce the result q(δ).
A Rosin diagram. The input data is written in the ‘PI’ language; this is compiled to the ‘PO’ language by a compiler implemented in IBM 360 machine code. That compiler is running in a interpreter for 360 machine code that is implemented in 2065 machine code, which itself is running on an IBM 2065 machine.
An Earley–Sturgis diagram. The f function is written in Lisp, then translated to machine language (ML) by a compiler written in Lisp. That compiler is running inside a Lisp interpreter written in ML. The dashed line seems to be an attempt to clarify that the right-hand piece is the result of the composition (cf. Problem 2).

Our proposal: J-diagrams

We propose a redesign in which the basic piece has the shape of a backwards ‘J’.

This design makes it clear that there are two (and only two) modes of composition, as shown in the pictures below. Thus Problem 1 and Problem 3 are immediately addressed. To address Problem 2, we simply propose that J-pieces that are the result of the composition are not shown in the same diagram.

Left: Horizontal composition of J-pieces. Right: Diagonal composition of J-pieces.

We also propose two special cases of the J-piece that omit one interface. An I-piece omits the ‘target’ interface; it represents an interpreter, following Earley and Sturgis. A ‘tube’ omits the ‘implementation’ interface; it represents a compilation step that does not require an implementation, such as compiling Clight [4] to its superset, C. A ‘stopper’ omits both the ‘target’ and ‘implementation’ interface; it represents a machine that can directly execute programs in the ‘source’ language.

Left: An I-piece. Middle: A tube. Right: A stopper.

Next, we illustrate J-diagrams using several examples: Java compilation, verified C compilation using CompCert, a compiler testing framework, cross-compilation, and the history of the XPL programming language. In each case, we argue that the J-diagram is intuitive and clear, especially when compared to an equivalent T-diagram or ad-hoc diagram.

Java compilation as a J-diagram. A Java program is compiled to Java Bytecode using the javac compiler, which is written in Java. The JVM is an interpreter for Java Bytecode written in C++. The JVM itself is compiled using a C++ compiler such as gcc. Both javac and gcc must themselves be compiled somehow, but this diagram elects to leave those steps unspecified.
A J-diagram showing the high-level architecture of the CompCert compiler. The initial translation from C into the Clight sublanguage is not verified, but the rest of the compiler is written in Coq. The Coq tool is used to extract executable OCaml code.
A J-diagram showing how compilers written by students are assessed. Students submit a C-to-MIPS compiler implemented in C++, which is built using g++. The output of this compiler is tested by running it on the qemu MIPS emulator, which is implemented in C and built using gcc. In this diagram, we assume that g++ and gcc have both been compiled for an x86 machine.
Explaining cross-compilation using (top) an ad-hoc diagram from Wikipedia and (bottom) a J-diagram. The aim here is to use one machine (say, running Windows on an IA-32 processor) to build a compiler that can execute on another machine (say, running Mac OS on an x86_64 processor) and whose output that be executed on a third machine (say, running Android on an ARM processor). In both cases, the sequence is as follows. We first use the native Microsoft Visual C++ (msvc) compiler on the Windows machine to build the native g++ compiler for that machine. We then use this compiler to build a g++ compiler that targets the Mac. This compiler, in turn, is used to build the final compiler, which will execute on the Mac but will generate programs that execute on the Android device.
Explaining the history of the XPL language using (top) a T-diagram and (bottom) a J-diagram. The T-diagram, which is due to McKeeman, is hard to understand because of the large number of T-pieces and because of the ambiguity about which connections are meaningful. As a J-diagram, the history of the XPL language becomes much easier to read. The first XPL compiler (XCOM I) targeted IBM 360 machine code, and was written in ALGOL, which itself was compiled and executed on a Burroughs B5500 computer (far right). The second XPL compiler (XCOM II) was written in XPL, and was compiled using XCOM I, via a bootstrapping process. Similarly, the third version (XCOM III) was compiled using XCOM II. At the far left of the diagram, we see XPL being used to produce a compiler for a new user.

Interpreting J-diagrams

Our J-diagrams can be interpreted as vertex-labelled binary trees. As shown below, each J-piece represents a tree with three vertices, with each vertex labelled either with a language or with the distinguished ‘•’ symbol.

Interpreting J-pieces, I-pieces, tubes, and stoppers as vertex-labelled binary trees.

Composition of pieces corresponds to gluing trees together so that a leaf of the first tree has the same label as the root of the second tree. The picture provides an example of a tree obtained in this way.

The J-diagram from earlier in this blog post, interpreted as a tree.

We claim that interpreting a J-diagram as a tree in this way is natural. At the root of the tree is the ‘source’ language we wish to execute. At the leaves of the tree are all the languages that we need to be able to execute in order to be able to execute the language at the root of the tree. There is actually no particular need to distinguish between the ‘implementation’ language and the ‘target’ language – both are simply languages that we need to be able to execute. In other words, a compiler can be seen as a scheme for reducing the problem of executing programs in its ‘source’ language into two different problems:

  • the problem of executing programs in its ‘target’ language (the compiler’s output) and
  • the problem of executing programs in its ‘implementation’ language (the compiler itself). For example, the J-diagram above can be thought of as a method for executing C programs that depends upon a method for executing x86 programs.

Using J-diagrams to analyse compilers

We now suggest how J-diagrams could be used to aid the analysis and comparison of compilation strategies.

The picture below shows a graph of possible compilation strategies from LISP to JavaScript. This is the kind of graph generated by Akram’s tool for discovering all ways – whether direct or indirect – of compiling between any two given languages. Each vertex represents a language, and each edge represents a compiler from one language to another.

A graph of possible compilation strategies from LISP to JavaScript.

What is missing from this graph is information about how each compiler is implemented, and hence which further compilers may be needed in order to build them. To address this shortcoming, the picture below shows both possible compilation strategies as a pair of J-diagrams. By using this representation, it becomes clear that if we compile via Java Bytecode we shall also need the ability to execute Java code, and if we compile via LLVM IR we shall also need the ability to execute C++ code. Thus, more information is made available for the designer to make an informed choice.

Designing a compilation strategy from LISP to JavaScript with the help of J-diagrams. Left: LISP is compiled to Javascript via Java Bytecode using ABCL and TeaVM, both of which are implemented in Java. Right: LISP is compiled to Javascript via LLVM IR using Clasp and Emscripten, both implemented in C++.

If we have additional information about each J-piece beyond simply the languages involved, then we can use J-diagrams as an aid for reasoning about the performance of compilation strategies, as outlined next.

For any single J-piece, it is natural to ask:

  • How good is the generated output? Or, to coin a phrase, how well-targeted is the compiler? As a simple example, a compiler that inserts unnecessary sleep() instructions into its generated output is probably not well-targeted.
  • How long does it take to run? Or, to coin another phrase, how well-implemented is the compiler? As a simple example, a compiler whose source code includes unnecessary sleep() instructions is probably not well-implemented.

Then, for a J-diagram with source language S, it is natural to ask: if I use this compilation strategy to execute programs in language S, how quickly will I get the result?

A straightforward answer is that it depends upon how well-targeted and well-implemented all the compilers in the network are, since all of the compilers must be run before the final result can be obtained. For instance, in the J-diagram of the CompCert compiler above, in order to calculate the result of running the input C program, we need to run Coq to obtain an OCaml implementation of the verified compiler, then we need to run both the unverified compiler and the verified compiler. If the Coq compiler is well-targeted, then the verified compiler will be well-implemented.

A more nuanced answer involves distinguishing the time taken to obtain an executable (“compilation time”) from the time taken to run that executable (“running time”). In fact, there are several compilation times here: the time taken to compile the source program, the time taken to compile that compiler, the time taken to compile that compiler, and so on.

Traditionally, one is more concerned with improving the running time than the compilation time, since an executable can be compiled once and then run many times. In turn, the time taken to compile the executable tends to be more important than the time taken to compile the compiler, since a compiler can be compiled once and then used many times.

These various compilation times correspond to the different rows of a J-diagram. The running time depends upon how well-targeted the top row of J-pieces are, and how efficient is the machine or interpreter that runs this executable. The time taken to obtain this executable, on the other hand, depends upon how well-implemented the top row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces (in the second row) that compile them. The time taken to obtain the compiler that obtains this executable depends upon how well-implemented the second row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces in the third row, and so on.

For instance, the time taken to obtain the CompCert compiler depends upon how well-implemented the Coq compiler is. The time taken to use CompCert obtain an executable from a user-provided C program depends upon how well-implemented the unverified and verified compilers are, and how well-targeted the Coq compiler is. The running time of this executable depends upon how well-targeted the unverified and verified compilers are.

We can formalise this using the runningTime and compileTime functions defined below. As an example, we can see by unfolding those definitions that the ‘compile time’ for the J-diagram about student compilers depends upon the wellimplementedness of the student’s compiler and the welltargetedness of g++, and that the ‘running time’ for that J-diagram depends upon the welltargetedness of the student’s compiler, the wellimplementedness of qemu, and the welltargetedness of gcc, all of which is as expected.

How the running time and compile time of a program depend upon the well- implementedness and well-targetedness of the various compilers involved. Following a Prolog-like notation, we write ‘:−’ to mean ‘depends upon’.

Conclusion

We have investigated the foundations of diagrams that describe how compilers can be composed. To conclude, let us return to the four questions posed in the introduction.

  • First, we explained that existing graphical systems, chiefly T-diagrams, are unsatisfactory because the rules about how compilers can be composed are not intuitive; for instance, one form of diagonal composition is legal but the symmetric one is not.
  • Second, we presented a new visual language, based on ‘backwards J’-shaped pieces, that addressed the shortcomings of previous systems.
  • Third, in answer to the question of what these diagrams ‘mean’, we showed how they can be interpreted quite straightforwardly as binary trees. A J-diagram can thus be thought of as a strategy for problem-reduction: it describes how the problem of executing programs in the language at its root vertex can be reduced to the problems of executing programs in all the languages at its leaf vertices.
  • Fourth, in answer to the question of what these diagrams tell us about the compilers they depict, we suggested how they could be a useful aid when comparing different compilation strategies, or when analysing which parts of the compilation process are on the ‘critical path’ regarding compiling time or running time.

Ultimately, we expect J-diagrams will prove most valuable in teaching settings, because they are – we believe – an improvement upon the T-diagrams that are already in widespread use.

The authors would like to thank Diana Costa, Alastair Donaldson, Roser Pujadas, Will Venters, and Matt Windsor for helpful discussions.

6 thoughts on “Diagrams for Composing Compilers”

  1. Hi John. A superb post — thanks so much for doing this!

    I was only exposed to T diagrams recently when they were included in the lecture materials I inherited. I found them a readability disaster, at least to my own brain, and made some ad-hoc changes for my lectures (including I-pieces almost as you have them). But for future years I will definitely revise our materials to use your notation.

    When I wrote my own notes on T-diagram problems, I noticed four main issues. Two are ones you tackled with the J and I pieces: the “horizontal” and “diagonal” composition notations are not at all clear or self-suggesting, and T-pieces don’t capture interpreters.

    My other two were (1) they do not distinguish source artifacts from built artifacts, even though this is critical in understanding bootstrapping etc., and (2) they do not provide a means to show “source provenance”, i.e. which compiler artifacts are derived from the same source, which matters for the same reason. Any thoughts on good ways to solve these? I think “use colour” is probably sufficient, but would be interested if you had other ideas.

    1. Thanks for your kind words, Stephen! Sounds like we’ve had similar reactions to T-diagrams then!

      Regarding your point (1), perhaps you can distinguish source artifacts from built artifacts by changing the implementation language from a source language to a machine language? For instance, “GCC source” could be {src=C, imp=C++, tgt=x86}, while “GCC built on an x86 machine” could be {src=C, imp=x86, tgt=x86}. Would that satisfy you?

      As for your second point, yeah, I don’t see a particularly elegant way of clarifying when two artifacts are derived from the same source. In my own lecturing material I used the same solution you describe — colour!

      1. Yes, I think it’s a fine “80% solution” simply to look at the source vs machine distinction for the lower prong. The missing 20% is for cases like a quick-and-dirty compiler written in assembly language, which has historically been a common-ish part of bootstrapping recipes. Maybe “light colour vs dark colour” is the way to go for source vs built, or something like that.

  2. Because the J is backwards, how about you call it a “tau diagram”?

    Also, it perhaps might be useful to include a machine arch with the language the the source side to distinguish between single-target compilers like GCC (where the target is choose at compile time) and multi-target compilers like GCC where it is chosen at run-time. This can be used to provide a bit more insight into the cross compilation examples.

    1. Thanks for your comment! Yeah I thought about “tau diagram”, but when I initially wrote the post in LaTeX, I realised that the tau that LaTeX produces doesn’t actually look much like the diagrams — it curves in a different place! Plus, “J” is a more familiar letter.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s