# Memory Consistency Models, and how to compare them automatically

My POPL 2017 paper with Mark Batty, Tyler Sorensen, and George Constantinides is all about memory consistency models, and how we can use an automatic constraint solver to compare them. In this blog post, I will discuss:

• What are memory consistency models, and why do we want to compare them?
• Why is comparing memory consistency models difficult, and how do we get around those difficulties in our work?
• What can we achieve, now that we can compare memory consistency models automatically?

# What is a memory consistency model?

Modern computers achieve blistering performance by having their computational tasks divided between many ‘mini-computers’ that run simultaneously. These mini-computers all have access to a collection of shared memory locations, through which they can communicate with one another.

It is tempting to assume that when one of the mini-computers stores some data into one of these shared memory locations, that data will immediately become visible to any other mini-computer that subsequently loads from that location. This assumption, together with the assumption that each mini-computer performs its stores and loads in the same order as they were written by the programmer, encodes the simplest memory consistency model. The model is called sequential consistency, and was first formalised by Leslie Lamport in 1979.

However, modern computers break several of the assumptions that sequential consistency makes. On an x86 computer, for instance, one mini-computer’s store to memory is not immediately visible to all the other mini-computers, because the data to be stored is held in a ‘write buffer‘ before reaching the main memory. Other computers, such as the Power multiprocessors made by IBM, or the graphics processors made by NVIDIA or AMD, incorporate further architectural optimisations that confound sequential consistency. Also responsible for breaking the assumptions of sequential consistency are compilers. Compilers take the programmer’s code, which may be written in a language like C, C++, Java, or OpenCL, and translate it into code that computers can run. They routinely reorder instructions when they judge it efficient (and safe) to do so, and this can result in loads and stores being performed in a different order from that specified by the programmer.

Accordingly, modern computers (like x86 and Power multiprocessors, and NVIDIA and AMD graphics processors) and modern programming languages (like C and OpenCL) define memory consistency models that are more permissive (i.e. allow more behaviours) than sequential consistency. As an example, consider the following program, which involves two threads separated by a double vertical line (each thread is carried out by a different mini-computer) and two shared memory locations $\tt x$ and $\tt y$.

The left-hand thread stores $1$ in location ${\tt x}$ and then loads the value of ${\tt y}$ into the variable ${\tt r0}$. Symmetrically, the right-hand thread stores to ${\tt y}$ and then loads from ${\tt x}$. If this program were executed on a computer that conformed to sequential consistency, only three outcomes would be possible: either the left-hand thread finishes before the right-hand thread starts (which results in ${\tt r0}=0$ and ${\tt r1}=1$), or the right-hand thread finishes before the left-hand thread starts (which results in ${\tt r0}=1$ and ${\tt r1}=0$), or both stores finish before either load starts (which results in ${\tt r0}=1$ and ${\tt r1}=1$). The x86 memory consistency model, however, permits a fourth possibility (which is experimentally observable on actual x86 machines): ${\tt r0}=0$ and ${\tt r1}=0$. This surprising outcome can be explained by both stores being performed simultaneously but their effects on main memory being delayed by the aforementioned write buffer, leaving the loads able to observe the initial values for both locations. An alternative explanation for this behaviour is in terms of compilation: if an optimising compiler judges that the two instructions in each thread are independent of each other and can be reordered, then both ${\tt r0}$ and ${\tt r1}$ can be assigned $0$.

# How are memory consistency models defined?

Beyond specific examples such as the one above, how does one determine in general which behaviours are allowed or disallowed when running a program under a given memory consistency model? One of the main approaches is the axiomatic approach. The axiomatic approach to determining allowed program behaviours has two phases. In the first phase, the program is translated into its set of candidate executions, each of which represents one way the program might behave when it is run. For instance, the program drawn above has the following four candidate executions.

Each execution is composed of a set of events joined with arrows. Each event represents a memory-related action being performed by the computer. Store instructions give rise to write (W ) events, and load instructions give rise to read (R) events. Each event is tagged with the location it accesses (e.g. ${\tt x}$ or ${\tt y}$) and the value it reads or writes (e.g. $0$ or $1$). The dotted rectangles group events into different threads. The black ‘sb’ arrows (which stands for sequenced before) capture the order of the instructions in the original program. For instance, the “${\tt x=1}$” instruction is ordered before the “${\tt r0=y}$” instruction in the original program, so each of the candidate executions above places an ‘sb’ arrow from the write of ${\tt x}$ to the read of ${\tt y}$. The red ‘rf’ arrows (which stands for reads from) capture how data moves between the events of the execution: each read event either is pointed to by an ‘rf’ arrow that originates at the write event from which the read event obtains its value, or has no ‘rf’ arrow pointing to it, which means that the read event observes the default value of $0$. The candidate executions above cover all the possible ways to place the ‘rf’ arrows. Finally, the purple ‘fr’ arrows (which stands for from reads) link each read event to all the write events that overwrite the data that the read event observed. For instance, in the bottom-left execution above, the right-hand thread’s read of ${\tt x}$ observes the initial value, $0$. This value is overwritten when the left-hand thread writes ${\tt x}$ to $1$, so we draw an ‘fr’ edge from the read to the write.

In the second phase, the set of candidate executions is pruned, to leave just the executions that the memory consistency model allows. The pruning process is done on an execution-by-execution basis. Each candidate execution is checked for certain patterns, and as long as it does not exhibit any patterns that the memory consistency model deems inconsistent, it is allowed. For instance, sequential consistency states that it is forbidden for ‘fr’ and ‘sb’ arrows to form loops, as they do in the bottom-right execution above. It is for this reason that sequential consistency forbids the outcome ${\tt r0}=0$ and ${\tt r1}=0$. The x86 memory consistency model, on the other hand, does not impose this particular rule, which results in that outcome being allowed.

Actually, the situation is a little more complicated than this. In programming languages like C or OpenCL, if any of a program’s consistent candidate executions is found to contain a data race (which roughly means that one mini-computer can be accessing a memory location at the same time as another mini-computer is storing a value to that location) then all executions become allowed. The idea is that if programmers write code that is liable to cause a data race, then no guarantees can be made about what will happen when that code is run. For an example of this, consider the C program below.

The left-hand thread stores $1$ to location ${\tt x}$, and then performs a special releasing store to location ${\tt y}$. Meanwhile, the right-hand thread performs a special acquiring load from ${\tt y}$ and then an ordinary load from ${\tt x}$. (NB: Releasing and acquiring accesses by different threads are allowed to proceed simultaneously, without causing a data race.) The four candidate executions of this program are given below.

According to C’s memory consistency model, all of these executions are consistent except for the bottom-left one. The reason for this has to do with a mechanism called release/acquire synchronisation. In C, if an acquiring load observes the value written by a releasing store, then all the events that happened prior to the releasing store are guaranteed to be visible to all the events that happen after the acquiring load. The bottom-left execution above is deemed inconsistent because even though the release/acquire synchronisation on location ${\tt y}$ has been successful (as can be seen by the presence of the ‘rf’ arrow), the right-hand thread does not see that ${\tt x}$ has been set to $1$.

However, a more worrying aspect of this program is that the top-left consistent execution contains a data race. In this execution, the release/acquire synchronisation fails because the acquiring load does not see the data from the releasing store. Nonetheless, the right-hand thread goes ahead and loads from ${\tt x}$. In the absence of the release/acquire synchronisation, there is nothing to stop the load and the store on ${\tt x}$ coinciding, and since these accesses are not special acquiring or releasing instructions, they cause a data race. This means that the program above can exhibit any behaviour at all – even the behaviour that we just deemed inconsistent!

# Why do we want to compare memory consistency models?

There are three main reasons for wanting to compare memory consistency models.

The first reason has to do with the fact that memory consistency models are continuing to evolve. For instance, several researchers have proposed amendments to the C memory consistency model, either to fix quirks in the model, or to give it nicer mathematical properties, or just to make it a bit simpler for programmers to understand. Each time a modification is proposed, it is important to compare it to the original model, to understand exactly which program behaviours become newly-allowed or newly-disallowed.

The second reason goes beyond just comparing two variants of the same memory consistency model, and instead looks at models that exist at different levels of abstraction. For instance, a comparison between the C memory consistency model and the x86 memory consistency model sheds light on how the code written by a C programmer should be compiled down to code that can be run by an x86 computer. A fundamental tenet of compilation is that a correct compiler never introduces new behaviours. In other words, if the C memory consistency model says that a certain behaviour of a program is not allowed, but the x86 memory consistency model says that the compiled program can exhibit that behaviour, then the compiler is wrong. In general, by comparing what a language-level memory consistency model says about programs with what the computer-level memory consistency model says about compiled programs, we have the opportunity to discover errors in compilers. These errors are difficult to find by experimentally testing because they may only show up intermittently, and they are difficult to find by manually inspecting the compiler because the memory consistency models involved are so complex and counterintuitive. Nonetheless, such errors are prevalent in real-world compilers; for instance, the standard scheme for compiling C to Power multiprocessors has recently been shown to be flawed, and a paper of mine (with Mark Batty, Brad Beckmann, and Ally Donaldson) identified two flaws in an compiler from OpenCL to AMD’s graphics processors.

The third reason involves comparing a memory consistency model to itself. As discussed above, we can check that a compiler is correctly mapping high-level language instructions to low-level computer instructions by comparing what the language-level memory consistency model says about programs with what the computer-level memory consistency model says about compiled programs. But before compilers perform the final mapping, they tweak the input program in an attempt to make it more efficient. Just like the final mapping, these tweaks (which are commonly called compiler optimisations) should never introduce new behaviours. In other words, if a memory consistency model says that a certain behaviour of a program is not allowed, but that this behaviour becomes allowed once the program is optimised, then the optimisation is wrong. Thus, by comparing which behaviours a memory consistency model allows before and after optimising, we have the opportunity to identify flaws in compiler optimisations. As with the compiler mappings discussed above, flaws in compiler optimisations are tricky to find, but are certainly prevalent. For instance, Viktor Vafeiadis and colleagues demonstrated last year that a whole host of optimisations routinely used by C compilers are actually not permitted by the C memory consistency model, because they can introduce new behaviours when performed on some (extremely contrived!) programs.

# Why is comparing memory consistency models difficult?

Comparing two memory consistency models boils down to finding a litmus test that can tell them apart. A litmus test is just a program (typically a rather small program) together with an outcome. We say that the test passes if when the program is run, the specified outcome is obtained. Litmus tests typically do not behave in the same way each time they are run, so we talk of tests that can never pass, or sometimes pass, or always pass. For instance, the first program we showed above, together with the outcome “${\tt r0}=0$ and ${\tt r1}=0$” , forms a litmus test – a litmus test that never passes under sequential consistency, and sometimes passes under the x86 memory consistency model.

To compare two variants of a memory consistency model, say $M$ and $N$, we search for a litmus test that never passes under $M$ but sometimes passes under $N$. If we are successful in this search, we have shown that $M$ is no more permissive than $N$. To check a compiler optimisation, we search for a litmus test that never passes, but which can pass after being optimised. And to check a compiler mapping, we search for a litmus test that never passes (under the language-level memory consistency model), but which can pass (under the computer-level memory consistency model) after the compiler mapping is applied.

In all three cases, the problem has the same ‘shape’: we are always looking for programs $P$ and $Q$, and a final outcome $\sigma$, such that $P$ and $Q$ are related in some way (either they are the same, or $P$ optimises to $Q$, or $P$ is compiler-mapped to $Q$), the litmus test $(P,\sigma)$ never passes, and the litmus test $(Q,\sigma)$ sometimes passes. In the diagram below, we use a black triangle to represent the appropriate relationship between $P$ and $Q$.

We can plug these requirements into an automatic constraint solver and ask it to come up with suitable litmus tests. However, it turns out that these requirements are extremely hard for automatic constraint solvers to handle. Showing that $P$ and $Q$ are related is easy, and showing that the litmus test $(Q,\sigma)$ sometimes passes is also easy: one just needs to demonstrate one execution of $Q$ that ends up with the desired outcome $\sigma$. But showing that $(P,\sigma)$ never passes is difficult, because one needs to demonstrate that every execution of $P$  – of which there may be a great number – does not end up with the outcome $\sigma$. In our initial experiments, we found it infeasible to compare even the simplest of memory consistency models using this approach.

# How do we get around this difficulty in our work?

One of the key ideas in our work is to flip the diagram above on its head. Rather than asking the constraint solver to find $P$ and $Q$ directly, we start by asking it to find executions $X$ and $Y$. We require that $X$ is inconsistent under memory consistency model $M$, that $Y$ is consistent under $N$, and that $X$ and $Y$ bear a certain relation to each other that we will explain in a moment. If the solver can find such a pair of executions, then we reverse-engineer those executions to form a pair of litmus tests. That is, from $X$ we construct a litmus test $(P,\sigma)$ that has $X$ as one of its candidate executions, and from $Y$ we construct a litmus test $(Q,\sigma)$ that has $Y$ as one of its candidate executions. This construction process is fairly straightforward: read events in the execution become load instructions in the litmus test, write events become stores, sequencing arrows in the execution determine the order of instructions in the litmus test, and reads-from arrows are reflected in the litmus test’s final outcome. We mentioned above that $X$ and $Y$ must bear a certain relation to each other; this relation, which we draw as a white triangle in the diagram below, serves to ensure that the constructed litmus tests have the same final outcome $\sigma$, and that the constructed litmus tests are related to each other in the required way (e.g. they are the same, or one optimises to the other, or one is compiler-mapped to the other).

At this point, we have obtained, as desired, a litmus test $(Q,\sigma)$ that sometimes passes  – we know this because the test has $Y$ as one of its candidate executions and $Y$ is consistent. We have also obtained a litmus test $(P,\sigma)$ that sometimes fails – we know this because the test has $X$ as one of its candidate executions and $X$ is inconsistent. But we required that $(P,\sigma)$ always fails! As things stand, it may be that $(P,\sigma)$ can pass by following a different execution, say $X'$, that is allowed.

We now give an example of exactly this situation. Suppose our constraint solver finds the execution below, which is inconsistent according to the C memory consistency model, and which we shall call $X$.

This execution we have seen already: it is one of the candidate executions of our message-passing program. We go on to convert $X$ into a litmus test in the manner described above, and this results in the following.

This litmus test we have also seen already: it is the message-passing program from earlier. And as we remarked earlier, the program exhibits a data race, and therefore all outcomes are allowed. For this reason, we have certainly not found a litmus test that always fails.

We are able to resolve this situation simply by imposing extra constraints when we generate $X$. One of the constraints we impose is that if an execution averts a data race via the release/acquire synchronisation mechanism, then one of the events in the potential data race must be conditional on the acquiring load successfully observing the value written by the releasing store. An example of an execution that meets this additional constraint is given below.

The extra ‘cd’ arrow represents a control dependency. It expresses that the read of ${\tt x}$ only happened because the read of ${\tt y}$ obtained the value that it did. When we transform executions into litmus tests, these arrows give rise to if-statements, as shown below.

This litmus test does not give rise to any data races, because the right-hand thread’s load of ${\tt x}$, which could potentially race with the left-hand thread’s store to ${\tt x}$, only happens if the release/acquire synchronisation on ${\tt y}$ is successful, and when this release/acquire synchronisation is successful it enforces a race-preventing ‘barrier’ between the accesses to ${\tt x}$.

Besides this constraint that governs release/acquire synchronisation, we impose a few more to deal with similar problematic situations (about which more details are available in the full paper). Collectively, we say that these additional constraints enforce that the execution $X$ is dead. We use this slightly overdramatic term to indicate that not only is the execution $X$ forbidden, but so is every other execution of the litmus test obtained from $X$.

It is possible that by adding the deadness constraint, we miss some litmus tests that would distinguish two memory consistency models. We have made every effort to build the deadness constraint so that it removes only those executions that are genuinely problematic, but we cannot be sure that it is not a little overenthusiastic in rejecting executions. This caveat notwithstanding, the approach we have described – which involves finding individual executions (one of which is dead and inconsistent, the other of which is consistent) and then constructing litmus tests from them – yields a fast and useful way to compare memory consistency models automatically.

# What can we achieve, now that we can compare memory consistency models automatically?

## Comparing variants of a model

The first thing we can do is we can automatically compare two variants of the same memory consistency model. Several researchers have proposed variants of the C memory consistency model, and each proposal is illustrated with a litmus test that can pass or can fail as a result of the change. The shortest litmus test that is able to distinguish the variants is often the most informative. We have used our approach to compare the original C memory consistency model against three recently-proposed variants, and found that in two of these cases, our automatic approach was able to distinguish them with shorter litmus tests than the authors of the proposals were able to come up with manually.

As an example, here is an execution (and accompanying litmus test) that I came up with, by dint of manual effort, for distinguishing the original C memory consistency model from a variant proposed by Batty, Donaldson and myself in a paper presented earlier this year. For various technical reasons that I will not go into here, the litmus test can pass under the original model, but cannot under the revised model.

And here is the execution (and accompanying litmus test) that our automatic approach came up with for distinguishing the same memory consistency models. Note that the automatically-found litmus test requires just five instructions where the manually-found one needs seven.

## Checking compiler optimisations

The second thing we can do is check compiler optimisations. As mentioned above, Viktor Vafeiadis and colleagues demonstrated that several optimisations routinely employed in C compilers are invalid according to the C memory consistency model. We have subjected these optimisations to an automatic check using our approach, and were able to generate many of the same results that Vafeiadis and colleagues discovered through manual effort. And as before, our automatically-found litmus tests are often simpler than the manually-found originals.

To illustrate this, here is an example given by Vafeiadis and colleagues that shows that it is not valid for a C compiler to replace a ‘relaxed’ store (a relaxed store is another special kind of store instruction that we have not mentioned before) with a releasing store. The litmus test given below always fails under the C memory consistency model, but if the highlighted relaxed store is replaced with a releasing store, then it sometimes passes.

And below is our automatically-generated version. This time, the relaxed-to-releasing replacement is done on a special fence instruction rather than a store, but the spirit of the optimisation is the same. Note that it needs just seven instructions (rather than eight), three memory locations (rather than four), and two threads (rather than three).

## Checking compiler mappings

The third thing we can do is check compiler mappings. Here, we go beyond recreating (and simplifying) known results, and use our approach to guide the development of a new memory consistency model. We used our approach to check a compiler mapping for translating instructions in the OpenCL programming language into instructions that can be understood by NVIDIA’s graphics processors. My work with Mark Batty and Ally Donaldson provides a precise OpenCL memory consistency model, and my earlier work (with Jade Alglave, Mark Batty, Ally Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl and Tyler Sorensen) provides a memory consistency model for NVIDIA graphics processors. Through an automatic comparison of these two memory consistency models, we were able to discover an error in the mapping. We blamed the error not on the mapping itself, but on the memory consistency model for NVIDIA graphics processors being unnecessarily permissive. That is: the model is experimentally sound with respect to the processors (because all the behaviours that were actually observed were indeed allowed by the model), but it also allows some behaviours that could never be observed. So, we devised a slightly more restrictive memory consistency model for NVIDIA graphics processors, and re-checked the compiler mapping. Happily, we were unable to discover any errors this time. It remained to be shown that the more restrictive model remains experimentally sound with respect to the actual graphics processors. To this end, we used our approach to compare the original NVIDIA memory consistency model with our revised one; this time asking not for a single litmus test that can distinguish them, but for all such tests (up to a reasonable size). We ran these tests on actual NVIDIA graphics processors (in the same way as we did when devising the original model), and were unable to find any examples of unsoundness in our revised model. We thus argue that our revised memory consistency model, developed with the aid of our automatic comparison approach, is both permissive enough to capture all the observable behaviours of NVIDIA graphics processors and restrictive enough to support an efficient compiler mapping from the OpenCL programming language.

# Next steps

Our full paper contains more examples of comparing variants of memory consistency models, more examples of checking compiler optimisations, and more examples of checking compiler mappings. It also provides a full definition of the ‘dead’ constraint, and the exact process we use for transforming executions into litmus tests.

The automatic constraint solver that we use is called Alloy. All of the memory consistency models, compiler optimisations, and compiler mappings that we have mentioned above have been written out in the Alloy format, and are all available to download from a GitHub repository.

## 3 thoughts on “Memory Consistency Models, and how to compare them automatically”

1. […] carry out an interesting analysis of our Arm model. We fed it into an automatic tool we developed in previous work, and discovered a bug. More precisely, we found that the way Arm processors do out-of-order […]

2. […] hardware. But our work also tackles the correctness issue, because we encoded our analysis in our Memalloy modelling tool, and checked that, for all programs of a bounded size, the schedule that our analysis generates is […]