In their POPL’17 paper, Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc Maranget, Kathy Gray, Ali Sezgin, Mark Batty, and Peter Sewell describe a semantics of weakly-consistent memory that copes (for the first time) with mixed-size memory accesses. In this blog post, I describe how their semantics can be explained rather nicely with some graphical ‘ribbon’ diagrams.

The Flur et al. semantics deals with programs that access individual bytes of memory as well as whole words (four bytes). For instance, given a program that writes a word starting at memory location {\tt x},

{\tt stw~0x03020100~0(x)}

then writes one byte starting at {\tt x}+2,

{\tt stb~0x11~2(x)}

and then reads the word starting at {\tt x},

{\tt lwz~r~0(x)}

their semantics can deduce that {\tt r} must end up holding {\tt 0x03021100}.

Like much work on weakly-consistent memory, the semantics is based on execution graphs. An execution graph is like a trace of a program, in that it comprises a set of runtime events; unlike a normal trace, the events do not occur in a simple sequence, but instead form the vertices of a mathematical graph. Relating the events are various kinds of arrow. Program order arrows capture the order in which the corresponding instructions are written in each thread of the source code; reads-from arrows show store events passing values to load events; and coherence order arrows depict the order in which stored values reach main memory.

Flur et al.’s paper uses conventional diagrams to depict these executions. For instance, the following diagram from their paper captures an execution of the program described above. Store events are labelled with a ‘W’, load events are labelled with an ‘R’, the black arrows are program order, brown arrows are coherence order (co), and red arrows are reads-from (rf).

mixed-size1

Note that some arrows are annotated with the set of memory locations that they are concerned with; for instance, the left-hand rf arrow is annotated with {\tt x}/2 and {\tt x}+3/1 to convey, respectively, that two bytes are being read starting from location {\tt x}, and that one byte is being read starting from the third byte after {\tt x}.

As such, these arrows are naturally associated with some sort of ‘thickness’. With this in mind, I propose the following alternative diagram for this execution.

mixed-size2.png

In my diagram, the reads-from relation and the coherence order are shown using red and blue ‘ribbons’ (respectively); program order is still depicted using black arrows. The position and thickness of a ribbon as it meets the rectangular outline of an event conveys which bytes the ribbon is concerned with. For instance, the load event can be seen to obtain its first two bytes from the {\tt W x/4} event, since that ribbon attaches to the first half of the load event’s top edge. The third byte comes from the {\tt W x+2/1} event, and the fourth comes from the {\tt W x/4} event. In this way, my diagram conveys the same information as Flur et al.’s original diagram, but requires much less text.

Unlike arrows, ribbons have no inherent directionality, but my diagram remains unambiguous because coherence ribbons always start at the bottom edge of a store event and end at the top edge of another store event, and because reads-from ribbons always start at the right-hand edge of a store event and end at the top edge of a load event.

Here are two further examples. The first is an execution that resembles the classic message-passing test: if one thread performs a double-word store, can another thread use two single-word loads to observe that the two words are not written simultaneously?mixed-size3.png

In my diagrammatic system, this execution would be drawn as follows.

mixed-size4.png

The last example is another execution from the message-passing family: if one thread writes to two consecutive words using two consecutive store instructions, can another thread use one double-word load to observe the two stores taking effect out-of-order? mixed-size5.png

In my diagrammatic system, this execution would be drawn as follows.

mixed-size6.png

In conclusion, I think this is quite a neat way to draw execution graphs for mixed-size concurrency. The main limitation is, of course, scalability: even for very small examples these diagrams need a lot of space, and a lot of attention must be paid to finding a good layout; this problem will surely make this diagrammatic system unmanageable for even slightly bigger examples. Still, a fun experiment!

All the LaTeX source code for making the ribbon diagrams in this post is available in my GitHub repository.

 

What follows is my attempt to summarise, for a lay audience, the main contributions of a paper I wrote with Nadesh Ramanathan, Shane Fleming, and George Constantinides for the FPGA 2017 conference.

Languages like C and C++ allow programmers to write concurrent programs. These are programs whose instructions are partitioned into multiple threads, all of which can be run at the same time. To avoid problematic situations where instructions in two different threads want to access the same piece of memory at the same time, conventional concurrent programs use locks. These work in a straightforward way: shared pieces of memory can only be accessed by a thread that holds the appropriate lock, and only one thread can hold a lock at a time. Yet acquiring and releasing locks imposes significant overhead. This has led to the emergence of the lock-free concurrent programming style, which manages to avoid locks by very carefully orchestrating each thread’s sequence of memory accesses.

It is well-known how to translate lock-free C++ programs into machine code that can be understood by conventional processors. But C++ programs can also be translated directly into hardware circuitry, and (until now) little attention has been paid to the question of how to create hardware for the class of lock-free C++ programs. This is a shame because, in the realm of conventional processors, lock-free programs have been conclusively demonstrated to be the fastest class of programs in the world.

In this work, we start with a tool, called LegUp, for translating C++ into hardware, and extend it so that it can cope with lock-free programs. We evaluated our work by using LegUp to create a piece of hardware from a widely-used lock-free program called a single-producer single-consumer circular queue, both before and after adding our extension. We found that the hardware generated by our extended LegUp is roughly twice as fast as that generated by the original tool (which relies on locks).

Going further, we investigate how to generate hardware for lock-free C++ programs that exploit relaxed concurrency. In relaxed concurrency, the programmer writes their code in such a way that the program will still work if certain instructions are performed out-of-order. Although programming in this way is very challenging, it can give the machine that runs the code the flexibility to rearrange the instructions into whichever order is the most efficient. Like many modern lock-free programs, the circular queue mentioned above is written using the relaxed concurrency style, but the original LegUp tool does not exploit the reordering opportunities that this affords (and nor do any other hardware synthesis tools of which we are aware).

In our work, we further extended LegUp so that it can detect when it is given a program that uses relaxed concurrency, and can generate more efficient hardware accordingly. On our circular queue case study, we found that making LegUp sensitive to relaxed concurrency enabled it to generate hardware that is roughly half as fast again (i.e., three times faster than the original lock-based hardware).

Software-to-hardware translators, such as LegUp, are becoming increasingly important as “programmable hardware” devices (called FPGAs) become increasingly powerful and popular. Our work has shown that it is possible to use these translators not just on run-of-the-mill sequential programs, but on high-performance programs that exploit adventurous features such as lock-freedom and relaxed concurrency.

My POPL 2017 paper with Mark Batty, Tyler Sorensen, and George Constantinides is all about memory consistency models, and how we can use an automatic constraint solver to compare them. In this blog post, I will discuss:

  • What are memory consistency models, and why do we want to compare them?
  • Why is comparing memory consistency models difficult, and how do we get around those difficulties in our work?
  • What can we achieve, now that we can compare memory consistency models automatically?

What is a memory consistency model?

Modern computers achieve blistering performance by having their computational tasks divided between many ‘mini-computers’ that run simultaneously. These mini-computers all have access to a collection of shared memory locations, through which they can communicate with one another.

It is tempting to assume that when one of the mini-computers stores some data into one of these shared memory locations, that data will immediately become visible to any other mini-computer that subsequently loads from that location. This assumption, together with the assumption that each mini-computer performs its stores and loads in the same order as they were written by the programmer, encodes the simplest memory consistency model. The model is called sequential consistency, and was first formalised by Leslie Lamport in 1979.

However, modern computers break several of the assumptions that sequential consistency makes. On an x86 computer, for instance, one mini-computer’s store to memory is not immediately visible to all the other mini-computers, because the data to be stored is held in a ‘write buffer‘ before reaching the main memory. Other computers, such as the Power multiprocessors made by IBM, or the graphics processors made by NVIDIA or AMD, incorporate further architectural optimisations that confound sequential consistency. Also responsible for breaking the assumptions of sequential consistency are compilers. Compilers take the programmer’s code, which may be written in a language like C, C++, Java, or OpenCL, and translate it into code that computers can run. They routinely reorder instructions when they judge it efficient (and safe) to do so, and this can result in loads and stores being performed in a different order from that specified by the programmer.

Accordingly, modern computers (like x86 and Power multiprocessors, and NVIDIA and AMD graphics processors) and modern programming languages (like C and OpenCL) define memory consistency models that are more permissive (i.e. allow more behaviours) than sequential consistency. As an example, consider the following program, which involves two threads separated by a double vertical line (each thread is carried out by a different mini-computer) and two shared memory locations \tt x and \tt y.


sb_code

The left-hand thread stores 1 in location {\tt x} and then loads the value of {\tt y} into the variable {\tt r0}. Symmetrically, the right-hand thread stores to {\tt y} and then loads from {\tt x}. If this program were executed on a computer that conformed to sequential consistency, only three outcomes would be possible: either the left-hand thread finishes before the right-hand thread starts (which results in {\tt r0}=0 and {\tt r1}=1), or the right-hand thread finishes before the left-hand thread starts (which results in {\tt r0}=1 and {\tt r1}=0), or both stores finish before either load starts (which results in {\tt r0}=1 and {\tt r1}=1). The x86 memory consistency model, however, permits a fourth possibility (which is experimentally observable on actual x86 machines): {\tt r0}=0 and {\tt r1}=0. This surprising outcome can be explained by both stores being performed simultaneously but their effects on main memory being delayed by the aforementioned write buffer, leaving the loads able to observe the initial values for both locations. An alternative explanation for this behaviour is in terms of compilation: if an optimising compiler judges that the two instructions in each thread are independent of each other and can be reordered, then both {\tt r0} and {\tt r1} can be assigned 0.

How are memory consistency models defined?

Beyond specific examples such as the one above, how does one determine in general which behaviours are allowed or disallowed when running a program under a given memory consistency model? One of the main approaches is the axiomatic approach. The axiomatic approach to determining allowed program behaviours has two phases. In the first phase, the program is translated into its set of candidate executions, each of which represents one way the program might behave when it is run. For instance, the program drawn above has the following four candidate executions.

sb_candidates

Each execution is composed of a set of events joined with arrows. Each event represents a memory-related action being performed by the computer. Store instructions give rise to write (W ) events, and load instructions give rise to read (R) events. Each event is tagged with the location it accesses (e.g. {\tt x} or {\tt y}) and the value it reads or writes (e.g. 0 or 1). The dotted rectangles group events into different threads. The black ‘sb’ arrows (which stands for sequenced before) capture the order of the instructions in the original program. For instance, the “{\tt x=1}” instruction is ordered before the “{\tt r0=y}” instruction in the original program, so each of the candidate executions above places an ‘sb’ arrow from the write of {\tt x} to the read of {\tt y}. The red ‘rf’ arrows (which stands for reads from) capture how data moves between the events of the execution: each read event either is pointed to by an ‘rf’ arrow that originates at the write event from which the read event obtains its value, or has no ‘rf’ arrow pointing to it, which means that the read event observes the default value of 0. The candidate executions above cover all the possible ways to place the ‘rf’ arrows. Finally, the purple ‘fr’ arrows (which stands for from reads) link each read event to all the write events that overwrite the data that the read event observed. For instance, in the bottom-left execution above, the right-hand thread’s read of {\tt x} observes the initial value, 0. This value is overwritten when the left-hand thread writes {\tt x} to 1, so we draw an ‘fr’ edge from the read to the write.

In the second phase, the set of candidate executions is pruned, to leave just the executions that the memory consistency model allows. The pruning process is done on an execution-by-execution basis. Each candidate execution is checked for certain patterns, and as long as it does not exhibit any patterns that the memory consistency model deems inconsistent, it is allowed. For instance, sequential consistency states that it is forbidden for ‘fr’ and ‘sb’ arrows to form loops, as they do in the bottom-right execution above. It is for this reason that sequential consistency forbids the outcome {\tt r0}=0 and {\tt r1}=0. The x86 memory consistency model, on the other hand, does not impose this particular rule, which results in that outcome being allowed.

Actually, the situation is a little more complicated than this. In programming languages like C or OpenCL, if any of a program’s consistent candidate executions is found to contain a data race (which roughly means that one mini-computer can be accessing a memory location at the same time as another mini-computer is storing a value to that location) then all executions become allowed. The idea is that if programmers write code that is liable to cause a data race, then no guarantees can be made about what will happen when that code is run. For an example of this, consider the C program below.

mp_code

The left-hand thread stores 1 to location {\tt x}, and then performs a special releasing store to location {\tt y}. Meanwhile, the right-hand thread performs a special acquiring load from {\tt y} and then an ordinary load from {\tt x}. (NB: Releasing and acquiring accesses by different threads are allowed to proceed simultaneously, without causing a data race.) The four candidate executions of this program are given below.

mp_candidates

According to C’s memory consistency model, all of these executions are consistent except for the bottom-left one. The reason for this has to do with a mechanism called release/acquire synchronisation. In C, if an acquiring load observes the value written by a releasing store, then all the events that happened prior to the releasing store are guaranteed to be visible to all the events that happen after the acquiring load. The bottom-left execution above is deemed inconsistent because even though the release/acquire synchronisation on location {\tt y} has been successful (as can be seen by the presence of the ‘rf’ arrow), the right-hand thread does not see that {\tt x} has been set to 1.

However, a more worrying aspect of this program is that the top-left consistent execution contains a data race. In this execution, the release/acquire synchronisation fails because the acquiring load does not see the data from the releasing store. Nonetheless, the right-hand thread goes ahead and loads from {\tt x}. In the absence of the release/acquire synchronisation, there is nothing to stop the load and the store on {\tt x} coinciding, and since these accesses are not special acquiring or releasing instructions, they cause a data race. This means that the program above can exhibit any behaviour at all – even the behaviour that we just deemed inconsistent!

Why do we want to compare memory consistency models?

There are three main reasons for wanting to compare memory consistency models.

The first reason has to do with the fact that memory consistency models are continuing to evolve. For instance, several researchers have proposed amendments to the C memory consistency model, either to fix quirks in the model, or to give it nicer mathematical properties, or just to make it a bit simpler for programmers to understand. Each time a modification is proposed, it is important to compare it to the original model, to understand exactly which program behaviours become newly-allowed or newly-disallowed.

The second reason goes beyond just comparing two variants of the same memory consistency model, and instead looks at models that exist at different levels of abstraction. For instance, a comparison between the C memory consistency model and the x86 memory consistency model sheds light on how the code written by a C programmer should be compiled down to code that can be run by an x86 computer. A fundamental tenet of compilation is that a correct compiler never introduces new behaviours. In other words, if the C memory consistency model says that a certain behaviour of a program is not allowed, but the x86 memory consistency model says that the compiled program can exhibit that behaviour, then the compiler is wrong. In general, by comparing what a language-level memory consistency model says about programs with what the computer-level memory consistency model says about compiled programs, we have the opportunity to discover errors in compilers. These errors are difficult to find by experimentally testing because they may only show up intermittently, and they are difficult to find by manually inspecting the compiler because the memory consistency models involved are so complex and counterintuitive. Nonetheless, such errors are prevalent in real-world compilers; for instance, the standard scheme for compiling C to Power multiprocessors has recently been shown to be flawed, and a paper of mine (with Mark Batty, Brad Beckmann, and Ally Donaldson) identified two flaws in an compiler from OpenCL to AMD’s graphics processors.

The third reason involves comparing a memory consistency model to itself. As discussed above, we can check that a compiler is correctly mapping high-level language instructions to low-level computer instructions by comparing what the language-level memory consistency model says about programs with what the computer-level memory consistency model says about compiled programs. But before compilers perform the final mapping, they tweak the input program in an attempt to make it more efficient. Just like the final mapping, these tweaks (which are commonly called compiler optimisations) should never introduce new behaviours. In other words, if a memory consistency model says that a certain behaviour of a program is not allowed, but that this behaviour becomes allowed once the program is optimised, then the optimisation is wrong. Thus, by comparing which behaviours a memory consistency model allows before and after optimising, we have the opportunity to identify flaws in compiler optimisations. As with the compiler mappings discussed above, flaws in compiler optimisations are tricky to find, but are certainly prevalent. For instance, Viktor Vafeiadis and colleagues demonstrated last year that a whole host of optimisations routinely used by C compilers are actually not permitted by the C memory consistency model, because they can introduce new behaviours when performed on some (extremely contrived!) programs.

Why is comparing memory consistency models difficult?

Comparing two memory consistency models boils down to finding a litmus test that can tell them apart. A litmus test is just a program (typically a rather small program) together with an outcome. We say that the test passes if when the program is run, the specified outcome is obtained. Litmus tests typically do not behave in the same way each time they are run, so we talk of tests that can never pass, or sometimes pass, or always pass. For instance, the first program we showed above, together with the outcome “{\tt r0}=0 and {\tt r1}=0” , forms a litmus test – a litmus test that never passes under sequential consistency, and sometimes passes under the x86 memory consistency model.

To compare two variants of a memory consistency model, say M and N, we search for a litmus test that never passes under M but sometimes passes under N. If we are successful in this search, we have shown that M is no more permissive than N. To check a compiler optimisation, we search for a litmus test that never passes, but which can pass after being optimised. And to check a compiler mapping, we search for a litmus test that never passes (under the language-level memory consistency model), but which can pass (under the computer-level memory consistency model) after the compiler mapping is applied.

In all three cases, the problem has the same ‘shape’: we are always looking for programs P and Q, and a final outcome \sigma, such that P and Q are related in some way (either they are the same, or P optimises to Q, or P is compiler-mapped to Q), the litmus test (P,\sigma) never passes, and the litmus test (Q,\sigma) sometimes passes. In the diagram below, we use a black triangle to represent the appropriate relationship between P and Q.

keyidea2_1

We can plug these requirements into an automatic constraint solver and ask it to come up with suitable litmus tests. However, it turns out that these requirements are extremely hard for automatic constraint solvers to handle. Showing that P and Q are related is easy, and showing that the litmus test (Q,\sigma) sometimes passes is also easy: one just needs to demonstrate one execution of Q that ends up with the desired outcome \sigma. But showing that (P,\sigma) never passes is difficult, because one needs to demonstrate that every execution of P  – of which there may be a great number – does not end up with the outcome \sigma. In our initial experiments, we found it infeasible to compare even the simplest of memory consistency models using this approach.

How do we get around this difficulty in our work?

One of the key ideas in our work is to flip the diagram above on its head. Rather than asking the constraint solver to find P and Q directly, we start by asking it to find executions X and Y. We require that X is inconsistent under memory consistency model M, that Y is consistent under N, and that X and Y bear a certain relation to each other that we will explain in a moment. If the solver can find such a pair of executions, then we reverse-engineer those executions to form a pair of litmus tests. That is, from X we construct a litmus test (P,\sigma) that has X as one of its candidate executions, and from Y we construct a litmus test (Q,\sigma) that has Y as one of its candidate executions. This construction process is fairly straightforward: read events in the execution become load instructions in the litmus test, write events become stores, sequencing arrows in the execution determine the order of instructions in the litmus test, and reads-from arrows are reflected in the litmus test’s final outcome. We mentioned above that X and Y must bear a certain relation to each other; this relation, which we draw as a white triangle in the diagram below, serves to ensure that the constructed litmus tests have the same final outcome \sigma, and that the constructed litmus tests are related to each other in the required way (e.g. they are the same, or one optimises to the other, or one is compiler-mapped to the other).

keyidea2_2

At this point, we have obtained, as desired, a litmus test (Q,\sigma) that sometimes passes  – we know this because the test has Y as one of its candidate executions and Y is consistent. We have also obtained a litmus test (P,\sigma) that sometimes fails – we know this because the test has X as one of its candidate executions and X is inconsistent. But we required that (P,\sigma) always fails! As things stand, it may be that (P,\sigma) can pass by following a different execution, say X', that is allowed.

keyidea2_3

We now give an example of exactly this situation. Suppose our constraint solver finds the execution below, which is inconsistent according to the C memory consistency model, and which we shall call X.

inconsistent_live

This execution we have seen already: it is one of the candidate executions of our message-passing program. We go on to convert X into a litmus test in the manner described above, and this results in the following.

mp_litmus

This litmus test we have also seen already: it is the message-passing program from earlier. And as we remarked earlier, the program exhibits a data race, and therefore all outcomes are allowed. For this reason, we have certainly not found a litmus test that always fails.

We are able to resolve this situation simply by imposing extra constraints when we generate X. One of the constraints we impose is that if an execution averts a data race via the release/acquire synchronisation mechanism, then one of the events in the potential data race must be conditional on the acquiring load successfully observing the value written by the releasing store. An example of an execution that meets this additional constraint is given below.

inconsistent_dead

The extra ‘cd’ arrow represents a control dependency. It expresses that the read of {\tt x} only happened because the read of {\tt y} obtained the value that it did. When we transform executions into litmus tests, these arrows give rise to if-statements, as shown below.

mp_litmus_fixed

This litmus test does not give rise to any data races, because the right-hand thread’s load of {\tt x}, which could potentially race with the left-hand thread’s store to {\tt x}, only happens if the release/acquire synchronisation on {\tt y} is successful, and when this release/acquire synchronisation is successful it enforces a race-preventing ‘barrier’ between the accesses to {\tt x}.

Besides this constraint that governs release/acquire synchronisation, we impose a few more to deal with similar problematic situations (about which more details are available in the full paper). Collectively, we say that these additional constraints enforce that the execution X is dead. We use this slightly overdramatic term to indicate that not only is the execution X forbidden, but so is every other execution of the litmus test obtained from X.

It is possible that by adding the deadness constraint, we miss some litmus tests that would distinguish two memory consistency models. We have made every effort to build the deadness constraint so that it removes only those executions that are genuinely problematic, but we cannot be sure that it is not a little overenthusiastic in rejecting executions. This caveat notwithstanding, the approach we have described – which involves finding individual executions (one of which is dead and inconsistent, the other of which is consistent) and then constructing litmus tests from them – yields a fast and useful way to compare memory consistency models automatically.

What can we achieve, now that we can compare memory consistency models automatically?

Comparing variants of a model

The first thing we can do is we can automatically compare two variants of the same memory consistency model. Several researchers have proposed variants of the C memory consistency model, and each proposal is illustrated with a litmus test that can pass or can fail as a result of the change. The shortest litmus test that is able to distinguish the variants is often the most informative. We have used our approach to compare the original C memory consistency model against three recently-proposed variants, and found that in two of these cases, our automatic approach was able to distinguish them with shorter litmus tests than the authors of the proposals were able to come up with manually.

As an example, here is an execution (and accompanying litmus test) that I came up with, by dint of manual effort, for distinguishing the original C memory consistency model from a variant proposed by Batty, Donaldson and myself in a paper presented earlier this year. For various technical reasons that I will not go into here, the litmus test can pass under the original model, but cannot under the revised model.

c11simp_manual

And here is the execution (and accompanying litmus test) that our automatic approach came up with for distinguishing the same memory consistency models. Note that the automatically-found litmus test requires just five instructions where the manually-found one needs seven.

c11simp_auto

Checking compiler optimisations

The second thing we can do is check compiler optimisations. As mentioned above, Viktor Vafeiadis and colleagues demonstrated that several optimisations routinely employed in C compilers are invalid according to the C memory consistency model. We have subjected these optimisations to an automatic check using our approach, and were able to generate many of the same results that Vafeiadis and colleagues discovered through manual effort. And as before, our automatically-found litmus tests are often simpler than the manually-found originals.

To illustrate this, here is an example given by Vafeiadis and colleagues that shows that it is not valid for a C compiler to replace a ‘relaxed’ store (a relaxed store is another special kind of store instruction that we have not mentioned before) with a releasing store. The litmus test given below always fails under the C memory consistency model, but if the highlighted relaxed store is replaced with a releasing store, then it sometimes passes.

strengthening_manual

And below is our automatically-generated version. This time, the relaxed-to-releasing replacement is done on a special fence instruction rather than a store, but the spirit of the optimisation is the same. Note that it needs just seven instructions (rather than eight), three memory locations (rather than four), and two threads (rather than three).

strengthening_auto

Checking compiler mappings

The third thing we can do is check compiler mappings. Here, we go beyond recreating (and simplifying) known results, and use our approach to guide the development of a new memory consistency model. We used our approach to check a compiler mapping for translating instructions in the OpenCL programming language into instructions that can be understood by NVIDIA’s graphics processors. My work with Mark Batty and Ally Donaldson provides a precise OpenCL memory consistency model, and my earlier work (with Jade Alglave, Mark Batty, Ally Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl and Tyler Sorensen) provides a memory consistency model for NVIDIA graphics processors. Through an automatic comparison of these two memory consistency models, we were able to discover an error in the mapping. We blamed the error not on the mapping itself, but on the memory consistency model for NVIDIA graphics processors being unnecessarily permissive. That is: the model is experimentally sound with respect to the processors (because all the behaviours that were actually observed were indeed allowed by the model), but it also allows some behaviours that could never be observed. So, we devised a slightly more restrictive memory consistency model for NVIDIA graphics processors, and re-checked the compiler mapping. Happily, we were unable to discover any errors this time. It remained to be shown that the more restrictive model remains experimentally sound with respect to the actual graphics processors. To this end, we used our approach to compare the original NVIDIA memory consistency model with our revised one; this time asking not for a single litmus test that can distinguish them, but for all such tests (up to a reasonable size). We ran these tests on actual NVIDIA graphics processors (in the same way as we did when devising the original model), and were unable to find any examples of unsoundness in our revised model. We thus argue that our revised memory consistency model, developed with the aid of our automatic comparison approach, is both permissive enough to capture all the observable behaviours of NVIDIA graphics processors and restrictive enough to support an efficient compiler mapping from the OpenCL programming language.

Next steps

Our full paper contains more examples of comparing variants of memory consistency models, more examples of checking compiler optimisations, and more examples of checking compiler mappings. It also provides a full definition of the ‘dead’ constraint, and the exact process we use for transforming executions into litmus tests.

The automatic constraint solver that we use is called Alloy. All of the memory consistency models, compiler optimisations, and compiler mappings that we have mentioned above have been written out in the Alloy format, and are all available to download from a GitHub repository.

 

It is quite well-known that the memory operations and fences provided by the C/C++11 languages are pretty complicated and confusing. For instance, my OOPSLA 2015 paper (joint with Batty, Beckmann, and Donaldson) explains that a proposed implementation of these instructions for next-generation AMD graphics cards is faulty because the designers misunderstood how they worked. And back in 2011, Batty et al. showed that (an early draft of) the C++11 language failed to enforce a desirable property called ‘sequential consistency for data-race-free programs’, as a result of a subtlety in the way some of these memory operations were being defined.

Today I found another example of the confusing arising from memory operations and fences in C/C++11. The cppreference.com website is a long-established, well-regarded, and widely-used reference for C and C++ programmers, but its description of how the “acquire/release” memory operations and fences work is quite wrong.

The website claims that acquire loads cannot be reordered with subsequent reads:

Screen Shot 2016-08-11 at 11.40.35.png

and that release stores cannot be reordered with preceding writes. It also claims that a store that follows a release fence cannot be reordered with any write that precedes the fence:

Screen Shot 2016-08-11 at 12.44.05.png

This is incorrect (or more precisely, it is not the whole truth). The truth, according to the official C/C++11 language specifications, is that

  • acquire loads cannot be reordered with any subsequent memory accesses (not just reads), and
  • release stores cannot be reordered with any preceding memory accesses (not just writes).

Similarly,

  • release fences prevent stores that follow the fence being reordered with any memory access that precedes the fence (not just writes), and
  • acquire fences prevent loads that precede the fence being reordered with any memory access that follows the fence.

Here is an example that demonstrates the different interpretations of acquire loads. Suppose we have two threads:

int main() {
  atomic_int x=0; atomic_int y=0;
  {{{ {
    // Thread 1:
    int r0 = atomic_load_explicit(&x, memory_order_relaxed); //a
    atomic_store_explicit(&y, 1, memory_order_release);      //b
  } ||| {
    // Thread 2:
    int r1 = atomic_load_explicit(&y, memory_order_acquire); //c
    atomic_store_explicit(&x, 1, memory_order_relaxed);      //d
  } }}}
}

Can we obtain r0 = 1 and r1 = 1 at the end? According to the official standard, the answer is no. This can be confirmed by copying the code above into the CppMem tool (which enumerates all the behaviours of a small C/C++ program that are allowed by the standard).

But if we follow the guidelines from cppreference.com and allow the acquire load (c) to be reordered with the subsequent relaxed store (d), then it is straightforward to obtain r0 = r1 = 1, simply by executing the instructions in the following order: d, a, b, c.

According to cppreference.com, acquire/release instructions are less powerful than they are in the official standard. This is problematic for programmers, because they might end up using stronger instructions than they really need, which may make their programs less efficient. It’s even more problematic for compiler-writers, because if they just follow the interpretation on cppreference.com, they might end up providing acquire/release instructions that are weaker than programmers expect, which means that correctly-written programs could give the wrong results when they are executed.

I believe that normalised data should be plotted on a logarithmic scale.

(By “normalised data”, I refer to data that is the ratio between two measurements that have the same dimension. Examples of such data include

  • the ratio between the execution time of a program before a proposed compiler optimisation has been applied and the execution time of that program afterwards, and
  • the power consumption of a proposed computer divided by the power consumption of a reference computer.)

In the vast majority of research papers published in computer architecture in the last year, when this data is plotted on a graph, a linear scale is used. I believe that a logarithmic scale would be more appropriate most of the time.

Why a logarithmic scale is appropriate for normalised data

I will illustrate my reasons with reference to the two graphs below, which show some sort of “speedup” that has been obtained on four benchmark programs, A, B, C, and D. The left graph uses a linear y-axis, while the right one plots the same data on a logarithmic y-axis.

Screen Shot 2016-08-06 at 20.09.15.png

  1. The natural origin for a speedup ratio is 1, not 0. That is, we are primarily interested in seeing whether a data point lies above or below 1 (i.e., whether a speedup or a slowdown was observed) rather than whether it lies above or below 0. In the right-hand graph above, it is immediately obvious that A and B experience a slowdown; this is slightly less obvious in the left-hand graph.
  2. Going from a 1x speedup to a 2x speedup is surely more impressive than going from a 3x speedup to a 4x speedup. But on the linear y-axis in the left-hand graph above, the distance between 1 and 2 is the same as the distance between 3 and 4, so these feats appear equally impressive.
  3. Obtaining a 2x speedup is likely to be considered just as good as obtaining a 2x slowdown (i.e., a 0.5x speedup) is bad. But on the linear y-axis in the left-hand graph above, the distance from 1 to 0.5 is much smaller than the distance from 1 to 2, so the speedup experienced by benchmark C is emphasised over the slowdown experienced by benchmark B, even though both have the same magnitude.
  4. On a linear scale, the “centre of gravity” to which the eye is drawn lies at the arithmetic mean, while on a logarithmic scale, the centre of gravity is at the geometric mean. When averaging dimensionless ratios, most authors tend to use the geometric mean.

Caveat. One should be careful when the ratio being plotted could be zero, for then the geometric mean will degenerate to zero (it being the product of the individual ratios). The log scale will also fail to handle such cases because the logarithm of zero is negative infinity (which tends not to fit on the page).

Remark. There is considerable debate on the relative merits of the arithmetic mean, geometric mean, and harmonic mean when summarising speedup figures. (The debate is well summarised by Citron et al. and by Eeckhout.) I argue that whenever the geometric mean is deemed a suitable average – and it appears to be the most prevalent among the three means when summarising speedup figures – then a logarithmic scale should be used for the same reasons.

Examples from recent research papers

In what follows, I consider sixteen papers that have been published this year in three of the top conferences in programming languages and computer architecture: PLDI 2016, POPL 2016, and ASPLOS 2016. Each paper includes a graph of some normalised data (typically some sort of “speedup” metric) that uses a linear scale. I argue that in each case, a logarithmic scale would be more appropriate.

  1. The following graph is from Yizhou Zhang et al. Accepting Blame for Safe Tunneled Exceptions in PLDI 2016. The authors report that the graph demonstrates an average speedup of 2.4% speedup, and by examination, I deduce that this refers to the geometric mean. The y=1 line has been added by the authors to help the reader distinguish speedups from slowdowns; such an addition would be unnecessary if a logarithmic y-axis were used instead.
    Screen Shot 2016-08-02 at 12.58.31
  2. The following graph is from Michael D. Adams et al. On the Complexity and Performance of Parsing with Derivatives in PLDI 2016. The average speedup is given as 2.04, but it’s not clear whether this is the arithmetic or geometric mean.
    Screen Shot 2016-08-02 at 22.20.54
  3. The following graph is from James Bornholt et al. Optimizing Synthesis with Metasketches in POPL 2016.
    Screen Shot 2016-08-03 at 08.53.49.png
  4. The following graph is from Sergi Abadal et al. WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication in ASPLOS 2016. This graph reports both the arithmetic and geometric mean. Note that if a logarithmic y-axis had been used, it would (probably) not be necessary to have several bars extending beyond the top of the graph.
    Screen Shot 2016-08-03 at 09.04.45.png
  5. The following graph is from Xiaodong Wang and José F. Martínez ReBudget: Trading Off Efficiency vs. Fairness in Market-Based Multicore Resource Allocation via Runtime Budget Reassignment in ASPLOS 2016. It reports the geometric mean.
    Screen Shot 2016-08-03 at 09.09.14.png
  6. The following graph is from Haishan Zhu and Mattan Erez Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems [Paywall] in ASPLOS 2016. Their “FG” figures represent time ratios, and are averaged using the arithmetic mean, while their “BG” figures represent rate ratios, and are averaged using the harmonic mean.
    Screen Shot 2016-08-03 at 09.13.24.png
  7. The following graph is from Anurag Mukkara et al. Whirlpool: Improving Dynamic Cache Management with Static Data Classification in ASPLOS 2016. No mean speedup is calculated for these measurements in the paper; only the ranges are reported.
    Screen Shot 2016-08-03 at 09.20.12.png
    The following graphs are from the same paper. For these graphs, the authors quote the geometric mean speedups.
    Screen Shot 2016-08-03 at 09.20.56.png
  8. The following graph is from Myeongjae Jeon et al. TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services [Paywall] in ASPLOS 2016.
    Screen Shot 2016-08-03 at 09.23.11.png
  9. The following graph is from Tong Zhang et al. TxRace: Efficient Data Race Detection Using Commodity Hardware Transactional Memory [Paywall] in ASPLOS 2016. It gives the geometric mean.
    Screen Shot 2016-08-03 at 09.26.33.png
  10. The following graph is from Nils Asmussen et al. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores in ASPLOS 2016.
    Screen Shot 2016-08-03 at 09.29.12.png
  11. The following graph is from Daniyal Liaqat et al. Sidewinder: An Energy Efficient and Developer Friendly Heterogeneous Architecture for Continuous Mobile Sensing in ASPLOS 2016. No averages are given, only ranges.
    Screen Shot 2016-08-03 at 09.31.29.png
  12. The following graph is from Jonathan Balkind et al. OpenPiton: An Open Source Manycore Research Framework in ASPLOS 2016. No averages are given.
    Screen Shot 2016-08-03 at 09.34.48.png
  13. The following graphs are from Alex Markuze et al. True IOMMU Protection from DMA Attacks: When Copy is Faster than Zero Copy in ASPLOS 2016.
    Screen Shot 2016-08-03 at 09.37.40.png
  14. The following graph is from Amro Awad et al. Silent Shredder: Zero-Cost Shredding for Secure Non-Volatile Main Memory Controllers in ASPLOS 2016. This graph is quite interesting because the quoted 3.3x speedup is actually the arithmetic mean of all the per-benchmark speedups. The geometric mean would give an average speedup of 2.4x, and is probably the more appropriate measure in this case.
    Screen Shot 2016-08-03 at 09.40.59.png
    The same paper also reports a reduction in the number of writes to main memory (see the graph below). This graph could also use a logarithmic axis, but there is a problem with the H264 benchmark, which requires 0x as many writes to main memory when “Silent Shredder” is used – i.e., none at all. On a log scale, this data point would need to stretch to negative infinity. It also means that although calculating the arithmetic mean of these ratios feels suspect to me, the geometric mean does not work at all in this case.
    Screen Shot 2016-08-06 at 16.26.15.png
  15. The following graph is from Youngjin Kwon et al. Sego: Pervasive Trusted Metadata for Efficiently Verified Untrusted System Services in ASPLOS 2016. No overall averages are given.
    Screen Shot 2016-08-03 at 10.39.45.png
  16. The following graphs are from Saurav Muralidharan et al. Architecture-Adaptive Code Variant Tuning in ASPLOS 2016.
    Screen Shot 2016-08-03 at 10.46.38.png

Further reading

  • Beth C. Gladen and Walter J. Rogan On graphing rate ratios in the American Journal of Epidemiology, 1983. This article argues that relative rates should be plotted on logarithmic rather than linear scales.
  • James R. Hebert and Donald R. Miller Plotting and discussion of rate ratios and relative risk estimates in the Journal of Clinical Epidemiology, 1989. This article argues that relative rates should actually be plotted not on logarithmic scales but on reciprocal scales!

It’s pretty common for database-driven apps to display a “last updated” field. For instance, here is a screenshot of my Mail app on my iPhone, showing that my emails were “Updated Just Now”.

photo

And here is a screenshot from the RealTimeTrains app, showing that the information was “Updated a few seconds ago”.

photo 2

Why do apps insist on using relative timestamps like this? It’s all very well having a piece of text to say that the information was “just updated”, but how do I know that the piece of text itself is up-to-date?

(It’s a bit like those “Lost Cat” posters that say “missing since last Friday” and are then left up for months.)

If instead the text gave an absolute timestamp like “Updated at 07:57”, then I would know that the information is up to date, and that the app hasn’t simply gotten stuck.

I transcribed Barry Gray’s theme song from Stingray, Aqua Marina, into piano sheet music. There’s a vocal part, replete with lyrics, plus a keyboard part comprising a right-hand and chord symbols. It roughly corresponds to what you might find in a musician’s “fake book”.

Available as a pdf file, as an editable Noteworthy Composer file, and an mp3 audio file (recorded on my iPhone).