T-diagrams (or tombstone diagrams) are widely used by teachers to explain how compilers and interpreters can be composed together to build and execute software. In this blog post, Paul Brunet and I revisit these diagrams, and show how they can be redesigned for better readability. We demonstrate how they can be applied to explain compiler concepts including bootstrapping and cross-compilation. We provide a semantics for our redesigned diagrams, based on binary trees. Finally, we suggest how our diagrams could be used to analyse the performance of a compilation system.
In introductory courses on compilers across the globe, students are taught about the interactions between compilers using ‘tombstone diagrams’ or ‘T-diagrams’. In this formalism, a compiler is represented as a ‘T-piece’. A T-piece, as shown below, is characterised by three labels: the source language of the compiler, the target language of the compiler, and the language in which the compiler is implemented.
Complex networks of compilers can be represented by composing these basic pieces in two ways. Horizontal composition means that the output of the first compiler is fed in as the input to the second. Diagonal composition means that the first compiler is itself compiled using the second compiler. The pictures below give examples of both forms.
In this blog post, we investigate the foundations of these diagrams. What are the rules that govern how they can be composed? Can we redesign them to make them more understandable? What do they mean in general? And what do they tell us about the compilers they depict?
The usefulness of diagrams for explaining the connectivity between compilers can be traced back at least as far as the UNCOL project in the 1950s, which was an early effort to provide a ‘Universal Computer-Oriented Language’ that could interface with many different front-ends and back-ends. Diagrams such as the one reproduced below were used to show how compilers producing UNCOL could be composed with those that consume it.
Shortly thereafter, T-diagrams were proposed by Bratman as an improvement on the UNCOL diagrams. As an example, the diagram below shows Bratman’s reimagining of the diagram above.
Three problems with T-diagrams are immediately apparent.
- When two T-pieces are diagonally composed, they meet at two interfaces, and it is not clear that only one of these interfaces is required to have matching languages. Diagonal composition only requires the ‘implementation’ language of the left piece to match the ‘source’ language of the right piece.
- T-diagrams typically do not distinguish the operands to the composition from the result of the composition. For example, in Bratman’s picture above, the right T-piece is actually the result of composing the left T-piece with the middle T-piece, but this relationship is not made clear by the diagram – all three pieces appear on an equal footing. (Of course, the obvious way to avoid this problem is simply not to show the result of the composition in the same diagram.)
- The symmetric shape of the T-pieces invites a third mode of composition, pictured below, that is not actually meaningful. (Note that this form of composition does seem to appear in Bratman’s picture above, but this is only because the operands and the result of composition are being conflated.)
Over the next couple of decades, several other diagrammatic systems were proposed, as pictured below. For example, Burkhardt proposed replacing Bratman’s T-pieces with X-shaped pieces, Sklansky et al. suggested D-shaped pieces, Rosin used pieces with a variety of shapes, and Earley and Sturgis extended Bratman’s T-pieces with I-pieces that represent interpreters. However, none of them satisfactorily resolved all three problems identified above.
Our proposal: J-diagrams
We propose a redesign in which the basic piece has the shape of a backwards ‘J’.
This design makes it clear that there are two (and only two) modes of composition, as shown in the pictures below. Thus Problem 1 and Problem 3 are immediately addressed. To address Problem 2, we simply propose that J-pieces that are the result of the composition are not shown in the same diagram.
We also propose two special cases of the J-piece that omit one interface. An I-piece omits the ‘target’ interface; it represents an interpreter, following Earley and Sturgis. A ‘tube’ omits the ‘implementation’ interface; it represents a compilation step that does not require an implementation, such as compiling Clight  to its superset, C. A ‘stopper’ omits both the ‘target’ and ‘implementation’ interface; it represents a machine that can directly execute programs in the ‘source’ language.
Next, we illustrate J-diagrams using several examples: Java compilation, verified C compilation using CompCert, a compiler testing framework, cross-compilation, and the history of the XPL programming language. In each case, we argue that the J-diagram is intuitive and clear, especially when compared to an equivalent T-diagram or ad-hoc diagram.
Our J-diagrams can be interpreted as vertex-labelled binary trees. As shown below, each J-piece represents a tree with three vertices, with each vertex labelled either with a language or with the distinguished ‘•’ symbol.
Composition of pieces corresponds to gluing trees together so that a leaf of the first tree has the same label as the root of the second tree. The picture provides an example of a tree obtained in this way.
We claim that interpreting a J-diagram as a tree in this way is natural. At the root of the tree is the ‘source’ language we wish to execute. At the leaves of the tree are all the languages that we need to be able to execute in order to be able to execute the language at the root of the tree. There is actually no particular need to distinguish between the ‘implementation’ language and the ‘target’ language – both are simply languages that we need to be able to execute. In other words, a compiler can be seen as a scheme for reducing the problem of executing programs in its ‘source’ language into two different problems:
- the problem of executing programs in its ‘target’ language (the compiler’s output) and
- the problem of executing programs in its ‘implementation’ language (the compiler itself). For example, the J-diagram above can be thought of as a method for executing C programs that depends upon a method for executing x86 programs.
Using J-diagrams to analyse compilers
We now suggest how J-diagrams could be used to aid the analysis and comparison of compilation strategies.
What is missing from this graph is information about how each compiler is implemented, and hence which further compilers may be needed in order to build them. To address this shortcoming, the picture below shows both possible compilation strategies as a pair of J-diagrams. By using this representation, it becomes clear that if we compile via Java Bytecode we shall also need the ability to execute Java code, and if we compile via LLVM IR we shall also need the ability to execute C++ code. Thus, more information is made available for the designer to make an informed choice.
If we have additional information about each J-piece beyond simply the languages involved, then we can use J-diagrams as an aid for reasoning about the performance of compilation strategies, as outlined next.
For any single J-piece, it is natural to ask:
- How good is the generated output? Or, to coin a phrase, how well-targeted is the compiler? As a simple example, a compiler that inserts unnecessary sleep() instructions into its generated output is probably not well-targeted.
- How long does it take to run? Or, to coin another phrase, how well-implemented is the compiler? As a simple example, a compiler whose source code includes unnecessary sleep() instructions is probably not well-implemented.
Then, for a J-diagram with source language S, it is natural to ask: if I use this compilation strategy to execute programs in language S, how quickly will I get the result?
A straightforward answer is that it depends upon how well-targeted and well-implemented all the compilers in the network are, since all of the compilers must be run before the final result can be obtained. For instance, in the J-diagram of the CompCert compiler above, in order to calculate the result of running the input C program, we need to run Coq to obtain an OCaml implementation of the verified compiler, then we need to run both the unverified compiler and the verified compiler. If the Coq compiler is well-targeted, then the verified compiler will be well-implemented.
A more nuanced answer involves distinguishing the time taken to obtain an executable (“compilation time”) from the time taken to run that executable (“running time”). In fact, there are several compilation times here: the time taken to compile the source program, the time taken to compile that compiler, the time taken to compile that compiler, and so on.
Traditionally, one is more concerned with improving the running time than the compilation time, since an executable can be compiled once and then run many times. In turn, the time taken to compile the executable tends to be more important than the time taken to compile the compiler, since a compiler can be compiled once and then used many times.
These various compilation times correspond to the different rows of a J-diagram. The running time depends upon how well-targeted the top row of J-pieces are, and how efficient is the machine or interpreter that runs this executable. The time taken to obtain this executable, on the other hand, depends upon how well-implemented the top row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces (in the second row) that compile them. The time taken to obtain the compiler that obtains this executable depends upon how well-implemented the second row of J-pieces are, which in turn depends upon how well-targeted are the J-pieces in the third row, and so on.
For instance, the time taken to obtain the CompCert compiler depends upon how well-implemented the Coq compiler is. The time taken to use CompCert obtain an executable from a user-provided C program depends upon how well-implemented the unverified and verified compilers are, and how well-targeted the Coq compiler is. The running time of this executable depends upon how well-targeted the unverified and verified compilers are.
We can formalise this using the runningTime and compileTime functions defined below. As an example, we can see by unfolding those definitions that the ‘compile time’ for the J-diagram about student compilers depends upon the well–implementedness of the student’s compiler and the well–targetedness of g++, and that the ‘running time’ for that J-diagram depends upon the well–targetedness of the student’s compiler, the well–implementedness of qemu, and the well–targetedness of gcc, all of which is as expected.
We have investigated the foundations of diagrams that describe how compilers can be composed. To conclude, let us return to the four questions posed in the introduction.
- First, we explained that existing graphical systems, chiefly T-diagrams, are unsatisfactory because the rules about how compilers can be composed are not intuitive; for instance, one form of diagonal composition is legal but the symmetric one is not.
- Second, we presented a new visual language, based on ‘backwards J’-shaped pieces, that addressed the shortcomings of previous systems.
- Third, in answer to the question of what these diagrams ‘mean’, we showed how they can be interpreted quite straightforwardly as binary trees. A J-diagram can thus be thought of as a strategy for problem-reduction: it describes how the problem of executing programs in the language at its root vertex can be reduced to the problems of executing programs in all the languages at its leaf vertices.
- Fourth, in answer to the question of what these diagrams tell us about the compilers they depict, we suggested how they could be a useful aid when comparing different compilation strategies, or when analysing which parts of the compilation process are on the ‘critical path’ regarding compiling time or running time.
Ultimately, we expect J-diagrams will prove most valuable in teaching settings, because they are – we believe – an improvement upon the T-diagrams that are already in widespread use.
The authors would like to thank Diana Costa, Alastair Donaldson, Roser Pujadas, Will Venters, and Matt Windsor for helpful discussions.