My POPL 2017 paper with Mark Batty, Tyler Sorensen, and George Constantinides is all about memory consistency models, and how we can use an automatic constraint solver to compare them. In this blog post, I will discuss:

  • What are memory consistency models, and why do we want to compare them?
  • Why is comparing memory consistency models difficult, and how do we get around those difficulties in our work?
  • What can we achieve, now that we can compare memory consistency models automatically?

What is a memory consistency model?

Modern computers achieve blistering performance by having their computational tasks divided between many ‘mini-computers’ that run simultaneously. These mini-computers all have access to a collection of shared memory locations, through which they can communicate with one another.

It is tempting to assume that when one of the mini-computers stores some data into one of these shared memory locations, that data will immediately become visible to any other mini-computer that subsequently loads from that location. This assumption, together with the assumption that each mini-computer performs its stores and loads in the same order as they were written by the programmer, encodes the simplest memory consistency model. The model is called sequential consistency, and was first formalised by Leslie Lamport in 1979.

However, modern computers break several of the assumptions that sequential consistency makes. On an x86 computer, for instance, one mini-computer’s store to memory is not immediately visible to all the other mini-computers, because the data to be stored is held in a ‘write buffer‘ before reaching the main memory. Other computers, such as the Power multiprocessors made by IBM, or the graphics processors made by NVIDIA or AMD, incorporate further architectural optimisations that confound sequential consistency. Also responsible for breaking the assumptions of sequential consistency are compilers. Compilers take the programmer’s code, which may be written in a language like C, C++, Java, or OpenCL, and translate it into code that computers can run. They routinely reorder instructions when they judge it efficient (and safe) to do so, and this can result in loads and stores being performed in a different order from that specified by the programmer.

Accordingly, modern computers (like x86 and Power multiprocessors, and NVIDIA and AMD graphics processors) and modern programming languages (like C and OpenCL) define memory consistency models that are more permissive (i.e. allow more behaviours) than sequential consistency. As an example, consider the following program, which involves two threads separated by a double vertical line (each thread is carried out by a different mini-computer) and two shared memory locations \tt x and \tt y.


sb_code

The left-hand thread stores 1 in location {\tt x} and then loads the value of {\tt y} into the variable {\tt r0}. Symmetrically, the right-hand thread stores to {\tt y} and then loads from {\tt x}. If this program were executed on a computer that conformed to sequential consistency, only three outcomes would be possible: either the left-hand thread finishes before the right-hand thread starts (which results in {\tt r0}=0 and {\tt r1}=1), or the right-hand thread finishes before the left-hand thread starts (which results in {\tt r0}=1 and {\tt r1}=0), or both stores finish before either load starts (which results in {\tt r0}=1 and {\tt r1}=1). The x86 memory consistency model, however, permits a fourth possibility (which is experimentally observable on actual x86 machines): {\tt r0}=0 and {\tt r1}=0. This surprising outcome can be explained by both stores being performed simultaneously but their effects on main memory being delayed by the aforementioned write buffer, leaving the loads able to observe the initial values for both locations. An alternative explanation for this behaviour is in terms of compilation: if an optimising compiler judges that the two instructions in each thread are independent of each other and can be reordered, then both {\tt r0} and {\tt r1} can be assigned 0.

How are memory consistency models defined?

Beyond specific examples such as the one above, how does one determine in general which behaviours are allowed or disallowed when running a program under a given memory consistency model? One of the main approaches is the axiomatic approach. The axiomatic approach to determining allowed program behaviours has two phases. In the first phase, the program is translated into its set of candidate executions, each of which represents one way the program might behave when it is run. For instance, the program drawn above has the following four candidate executions.

sb_candidates

Each execution is composed of a set of events joined with arrows. Each event represents a memory-related action being performed by the computer. Store instructions give rise to write (W ) events, and load instructions give rise to read (R) events. Each event is tagged with the location it accesses (e.g. {\tt x} or {\tt y}) and the value it reads or writes (e.g. 0 or 1). The dotted rectangles group events into different threads. The black ‘sb’ arrows (which stands for sequenced before) capture the order of the instructions in the original program. For instance, the “{\tt x=1}” instruction is ordered before the “{\tt r0=y}” instruction in the original program, so each of the candidate executions above places an ‘sb’ arrow from the write of {\tt x} to the read of {\tt y}. The red ‘rf’ arrows (which stands for reads from) capture how data moves between the events of the execution: each read event either is pointed to by an ‘rf’ arrow that originates at the write event from which the read event obtains its value, or has no ‘rf’ arrow pointing to it, which means that the read event observes the default value of 0. The candidate executions above cover all the possible ways to place the ‘rf’ arrows. Finally, the purple ‘fr’ arrows (which stands for from reads) link each read event to all the write events that overwrite the data that the read event observed. For instance, in the bottom-left execution above, the right-hand thread’s read of {\tt x} observes the initial value, 0. This value is overwritten when the left-hand thread writes {\tt x} to 1, so we draw an ‘fr’ edge from the read to the write.

In the second phase, the set of candidate executions is pruned, to leave just the executions that the memory consistency model allows. The pruning process is done on an execution-by-execution basis. Each candidate execution is checked for certain patterns, and as long as it does not exhibit any patterns that the memory consistency model deems inconsistent, it is allowed. For instance, sequential consistency states that it is forbidden for ‘fr’ and ‘sb’ arrows to form loops, as they do in the bottom-right execution above. It is for this reason that sequential consistency forbids the outcome {\tt r0}=0 and {\tt r1}=0. The x86 memory consistency model, on the other hand, does not impose this particular rule, which results in that outcome being allowed.

Actually, the situation is a little more complicated than this. In programming languages like C or OpenCL, if any of a program’s consistent candidate executions is found to contain a data race (which roughly means that one mini-computer can be accessing a memory location at the same time as another mini-computer is storing a value to that location) then all executions become allowed. The idea is that if programmers write code that is liable to cause a data race, then no guarantees can be made about what will happen when that code is run. For an example of this, consider the C program below.

mp_code

The left-hand thread stores 1 to location {\tt x}, and then performs a special releasing store to location {\tt y}. Meanwhile, the right-hand thread performs a special acquiring load from {\tt y} and then an ordinary load from {\tt x}. (NB: Releasing and acquiring accesses by different threads are allowed to proceed simultaneously, without causing a data race.) The four candidate executions of this program are given below.

mp_candidates

According to C’s memory consistency model, all of these executions are consistent except for the bottom-left one. The reason for this has to do with a mechanism called release/acquire synchronisation. In C, if an acquiring load observes the value written by a releasing store, then all the events that happened prior to the releasing store are guaranteed to be visible to all the events that happen after the acquiring load. The bottom-left execution above is deemed inconsistent because even though the release/acquire synchronisation on location {\tt y} has been successful (as can be seen by the presence of the ‘rf’ arrow), the right-hand thread does not see that {\tt x} has been set to 1.

However, a more worrying aspect of this program is that the top-left consistent execution contains a data race. In this execution, the release/acquire synchronisation fails because the acquiring load does not see the data from the releasing store. Nonetheless, the right-hand thread goes ahead and loads from {\tt x}. In the absence of the release/acquire synchronisation, there is nothing to stop the load and the store on {\tt x} coinciding, and since these accesses are not special acquiring or releasing instructions, they cause a data race. This means that the program above can exhibit any behaviour at all – even the behaviour that we just deemed inconsistent!

Why do we want to compare memory consistency models?

There are three main reasons for wanting to compare memory consistency models.

The first reason has to do with the fact that memory consistency models are continuing to evolve. For instance, several researchers have proposed amendments to the C memory consistency model, either to fix quirks in the model, or to give it nicer mathematical properties, or just to make it a bit simpler for programmers to understand. Each time a modification is proposed, it is important to compare it to the original model, to understand exactly which program behaviours become newly-allowed or newly-disallowed.

The second reason goes beyond just comparing two variants of the same memory consistency model, and instead looks at models that exist at different levels of abstraction. For instance, a comparison between the C memory consistency model and the x86 memory consistency model sheds light on how the code written by a C programmer should be compiled down to code that can be run by an x86 computer. A fundamental tenet of compilation is that a correct compiler never introduces new behaviours. In other words, if the C memory consistency model says that a certain behaviour of a program is not allowed, but the x86 memory consistency model says that the compiled program can exhibit that behaviour, then the compiler is wrong. In general, by comparing what a language-level memory consistency model says about programs with what the computer-level memory consistency model says about compiled programs, we have the opportunity to discover errors in compilers. These errors are difficult to find by experimentally testing because they may only show up intermittently, and they are difficult to find by manually inspecting the compiler because the memory consistency models involved are so complex and counterintuitive. Nonetheless, such errors are prevalent in real-world compilers; for instance, the standard scheme for compiling C to Power multiprocessors has recently been shown to be flawed, and a paper of mine (with Mark Batty, Brad Beckmann, and Ally Donaldson) identified two flaws in an compiler from OpenCL to AMD’s graphics processors.

The third reason involves comparing a memory consistency model to itself. As discussed above, we can check that a compiler is correctly mapping high-level language instructions to low-level computer instructions by comparing what the language-level memory consistency model says about programs with what the computer-level memory consistency model says about compiled programs. But before compilers perform the final mapping, they tweak the input program in an attempt to make it more efficient. Just like the final mapping, these tweaks (which are commonly called compiler optimisations) should never introduce new behaviours. In other words, if a memory consistency model says that a certain behaviour of a program is not allowed, but that this behaviour becomes allowed once the program is optimised, then the optimisation is wrong. Thus, by comparing which behaviours a memory consistency model allows before and after optimising, we have the opportunity to identify flaws in compiler optimisations. As with the compiler mappings discussed above, flaws in compiler optimisations are tricky to find, but are certainly prevalent. For instance, Viktor Vafeiadis and colleagues demonstrated last year that a whole host of optimisations routinely used by C compilers are actually not permitted by the C memory consistency model, because they can introduce new behaviours when performed on some (extremely contrived!) programs.

Why is comparing memory consistency models difficult?

Comparing two memory consistency models boils down to finding a litmus test that can tell them apart. A litmus test is just a program (typically a rather small program) together with an outcome. We say that the test passes if when the program is run, the specified outcome is obtained. Litmus tests typically do not behave in the same way each time they are run, so we talk of tests that can never pass, or sometimes pass, or always pass. For instance, the first program we showed above, together with the outcome “{\tt r0}=0 and {\tt r1}=0” , forms a litmus test – a litmus test that never passes under sequential consistency, and sometimes passes under the x86 memory consistency model.

To compare two variants of a memory consistency model, say M and N, we search for a litmus test that never passes under M but sometimes passes under N. If we are successful in this search, we have shown that M is no more permissive than N. To check a compiler optimisation, we search for a litmus test that never passes, but which can pass after being optimised. And to check a compiler mapping, we search for a litmus test that never passes (under the language-level memory consistency model), but which can pass (under the computer-level memory consistency model) after the compiler mapping is applied.

In all three cases, the problem has the same ‘shape’: we are always looking for programs P and Q, and a final outcome \sigma, such that P and Q are related in some way (either they are the same, or P optimises to Q, or P is compiler-mapped to Q), the litmus test (P,\sigma) never passes, and the litmus test (Q,\sigma) sometimes passes. In the diagram below, we use a black triangle to represent the appropriate relationship between P and Q.

keyidea2_1

We can plug these requirements into an automatic constraint solver and ask it to come up with suitable litmus tests. However, it turns out that these requirements are extremely hard for automatic constraint solvers to handle. Showing that P and Q are related is easy, and showing that the litmus test (Q,\sigma) sometimes passes is also easy: one just needs to demonstrate one execution of Q that ends up with the desired outcome \sigma. But showing that (P,\sigma) never passes is difficult, because one needs to demonstrate that every execution of P  – of which there may be a great number – does not end up with the outcome \sigma. In our initial experiments, we found it infeasible to compare even the simplest of memory consistency models using this approach.

How do we get around this difficulty in our work?

One of the key ideas in our work is to flip the diagram above on its head. Rather than asking the constraint solver to find P and Q directly, we start by asking it to find executions X and Y. We require that X is inconsistent under memory consistency model M, that Y is consistent under N, and that X and Y bear a certain relation to each other that we will explain in a moment. If the solver can find such a pair of executions, then we reverse-engineer those executions to form a pair of litmus tests. That is, from X we construct a litmus test (P,\sigma) that has X as one of its candidate executions, and from Y we construct a litmus test (Q,\sigma) that has Y as one of its candidate executions. This construction process is fairly straightforward: read events in the execution become load instructions in the litmus test, write events become stores, sequencing arrows in the execution determine the order of instructions in the litmus test, and reads-from arrows are reflected in the litmus test’s final outcome. We mentioned above that X and Y must bear a certain relation to each other; this relation, which we draw as a white triangle in the diagram below, serves to ensure that the constructed litmus tests have the same final outcome \sigma, and that the constructed litmus tests are related to each other in the required way (e.g. they are the same, or one optimises to the other, or one is compiler-mapped to the other).

keyidea2_2

At this point, we have obtained, as desired, a litmus test (Q,\sigma) that sometimes passes  – we know this because the test has Y as one of its candidate executions and Y is consistent. We have also obtained a litmus test (P,\sigma) that sometimes fails – we know this because the test has X as one of its candidate executions and X is inconsistent. But we required that (P,\sigma) always fails! As things stand, it may be that (P,\sigma) can pass by following a different execution, say X', that is allowed.

keyidea2_3

We now give an example of exactly this situation. Suppose our constraint solver finds the execution below, which is inconsistent according to the C memory consistency model, and which we shall call X.

inconsistent_live

This execution we have seen already: it is one of the candidate executions of our message-passing program. We go on to convert X into a litmus test in the manner described above, and this results in the following.

mp_litmus

This litmus test we have also seen already: it is the message-passing program from earlier. And as we remarked earlier, the program exhibits a data race, and therefore all outcomes are allowed. For this reason, we have certainly not found a litmus test that always fails.

We are able to resolve this situation simply by imposing extra constraints when we generate X. One of the constraints we impose is that if an execution averts a data race via the release/acquire synchronisation mechanism, then one of the events in the potential data race must be conditional on the acquiring load successfully observing the value written by the releasing store. An example of an execution that meets this additional constraint is given below.

inconsistent_dead

The extra ‘cd’ arrow represents a control dependency. It expresses that the read of {\tt x} only happened because the read of {\tt y} obtained the value that it did. When we transform executions into litmus tests, these arrows give rise to if-statements, as shown below.

mp_litmus_fixed

This litmus test does not give rise to any data races, because the right-hand thread’s load of {\tt x}, which could potentially race with the left-hand thread’s store to {\tt x}, only happens if the release/acquire synchronisation on {\tt y} is successful, and when this release/acquire synchronisation is successful it enforces a race-preventing ‘barrier’ between the accesses to {\tt x}.

Besides this constraint that governs release/acquire synchronisation, we impose a few more to deal with similar problematic situations (about which more details are available in the full paper). Collectively, we say that these additional constraints enforce that the execution X is dead. We use this slightly overdramatic term to indicate that not only is the execution X forbidden, but so is every other execution of the litmus test obtained from X.

It is possible that by adding the deadness constraint, we miss some litmus tests that would distinguish two memory consistency models. We have made every effort to build the deadness constraint so that it removes only those executions that are genuinely problematic, but we cannot be sure that it is not a little overenthusiastic in rejecting executions. This caveat notwithstanding, the approach we have described – which involves finding individual executions (one of which is dead and inconsistent, the other of which is consistent) and then constructing litmus tests from them – yields a fast and useful way to compare memory consistency models automatically.

What can we achieve, now that we can compare memory consistency models automatically?

Comparing variants of a model

The first thing we can do is we can automatically compare two variants of the same memory consistency model. Several researchers have proposed variants of the C memory consistency model, and each proposal is illustrated with a litmus test that can pass or can fail as a result of the change. The shortest litmus test that is able to distinguish the variants is often the most informative. We have used our approach to compare the original C memory consistency model against three recently-proposed variants, and found that in two of these cases, our automatic approach was able to distinguish them with shorter litmus tests than the authors of the proposals were able to come up with manually.

As an example, here is an execution (and accompanying litmus test) that I came up with, by dint of manual effort, for distinguishing the original C memory consistency model from a variant proposed by Batty, Donaldson and myself in a paper presented earlier this year. For various technical reasons that I will not go into here, the litmus test can pass under the original model, but cannot under the revised model.

c11simp_manual

And here is the execution (and accompanying litmus test) that our automatic approach came up with for distinguishing the same memory consistency models. Note that the automatically-found litmus test requires just five instructions where the manually-found one needs seven.

c11simp_auto

Checking compiler optimisations

The second thing we can do is check compiler optimisations. As mentioned above, Viktor Vafeiadis and colleagues demonstrated that several optimisations routinely employed in C compilers are invalid according to the C memory consistency model. We have subjected these optimisations to an automatic check using our approach, and were able to generate many of the same results that Vafeiadis and colleagues discovered through manual effort. And as before, our automatically-found litmus tests are often simpler than the manually-found originals.

To illustrate this, here is an example given by Vafeiadis and colleagues that shows that it is not valid for a C compiler to replace a ‘relaxed’ store (a relaxed store is another special kind of store instruction that we have not mentioned before) with a releasing store. The litmus test given below always fails under the C memory consistency model, but if the highlighted relaxed store is replaced with a releasing store, then it sometimes passes.

strengthening_manual

And below is our automatically-generated version. This time, the relaxed-to-releasing replacement is done on a special fence instruction rather than a store, but the spirit of the optimisation is the same. Note that it needs just seven instructions (rather than eight), three memory locations (rather than four), and two threads (rather than three).

strengthening_auto

Checking compiler mappings

The third thing we can do is check compiler mappings. Here, we go beyond recreating (and simplifying) known results, and use our approach to guide the development of a new memory consistency model. We used our approach to check a compiler mapping for translating instructions in the OpenCL programming language into instructions that can be understood by NVIDIA’s graphics processors. My work with Mark Batty and Ally Donaldson provides a precise OpenCL memory consistency model, and my earlier work (with Jade Alglave, Mark Batty, Ally Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl and Tyler Sorensen) provides a memory consistency model for NVIDIA graphics processors. Through an automatic comparison of these two memory consistency models, we were able to discover an error in the mapping. We blamed the error not on the mapping itself, but on the memory consistency model for NVIDIA graphics processors being unnecessarily permissive. That is: the model is experimentally sound with respect to the processors (because all the behaviours that were actually observed were indeed allowed by the model), but it also allows some behaviours that could never be observed. So, we devised a slightly more restrictive memory consistency model for NVIDIA graphics processors, and re-checked the compiler mapping. Happily, we were unable to discover any errors this time. It remained to be shown that the more restrictive model remains experimentally sound with respect to the actual graphics processors. To this end, we used our approach to compare the original NVIDIA memory consistency model with our revised one; this time asking not for a single litmus test that can distinguish them, but for all such tests (up to a reasonable size). We ran these tests on actual NVIDIA graphics processors (in the same way as we did when devising the original model), and were unable to find any examples of unsoundness in our revised model. We thus argue that our revised memory consistency model, developed with the aid of our automatic comparison approach, is both permissive enough to capture all the observable behaviours of NVIDIA graphics processors and restrictive enough to support an efficient compiler mapping from the OpenCL programming language.

Next steps

Our full paper contains more examples of comparing variants of memory consistency models, more examples of checking compiler optimisations, and more examples of checking compiler mappings. It also provides a full definition of the ‘dead’ constraint, and the exact process we use for transforming executions into litmus tests.

The automatic constraint solver that we use is called Alloy. All of the memory consistency models, compiler optimisations, and compiler mappings that we have mentioned above have been written out in the Alloy format, and are all available to download from a GitHub repository.

 

It is quite well-known that the memory operations and fences provided by the C/C++11 languages are pretty complicated and confusing. For instance, my OOPSLA 2015 paper (joint with Batty, Beckmann, and Donaldson) explains that a proposed implementation of these instructions for next-generation AMD graphics cards is faulty because the designers misunderstood how they worked. And back in 2011, Batty et al. showed that (an early draft of) the C++11 language failed to enforce a desirable property called ‘sequential consistency for data-race-free programs’, as a result of a subtlety in the way some of these memory operations were being defined.

Today I found another example of the confusing arising from memory operations and fences in C/C++11. The cppreference.com website is a long-established, well-regarded, and widely-used reference for C and C++ programmers, but its description of how the “acquire/release” memory operations and fences work is quite wrong.

The website claims that acquire loads cannot be reordered with subsequent reads:

Screen Shot 2016-08-11 at 11.40.35.png

and that release stores cannot be reordered with preceding writes. It also claims that a store that follows a release fence cannot be reordered with any write that precedes the fence:

Screen Shot 2016-08-11 at 12.44.05.png

This is incorrect (or more precisely, it is not the whole truth). The truth, according to the official C/C++11 language specifications, is that

  • acquire loads cannot be reordered with any subsequent memory accesses (not just reads), and
  • release stores cannot be reordered with any preceding memory accesses (not just writes).

Similarly,

  • release fences prevent stores that follow the fence being reordered with any memory access that precedes the fence (not just writes), and
  • acquire fences prevent loads that precede the fence being reordered with any memory access that follows the fence.

Here is an example that demonstrates the different interpretations of acquire loads. Suppose we have two threads:

int main() {
  atomic_int x=0; atomic_int y=0;
  {{{ {
    // Thread 1:
    int r0 = atomic_load_explicit(&x, memory_order_relaxed); //a
    atomic_store_explicit(&y, 1, memory_order_release);      //b
  } ||| {
    // Thread 2:
    int r1 = atomic_load_explicit(&y, memory_order_acquire); //c
    atomic_store_explicit(&x, 1, memory_order_relaxed);      //d
  } }}}
}

Can we obtain r0 = 1 and r1 = 1 at the end? According to the official standard, the answer is no. This can be confirmed by copying the code above into the CppMem tool (which enumerates all the behaviours of a small C/C++ program that are allowed by the standard).

But if we follow the guidelines from cppreference.com and allow the acquire load (c) to be reordered with the subsequent relaxed store (d), then it is straightforward to obtain r0 = r1 = 1, simply by executing the instructions in the following order: d, a, b, c.

According to cppreference.com, acquire/release instructions are less powerful than they are in the official standard. This is problematic for programmers, because they might end up using stronger instructions than they really need, which may make their programs less efficient. It’s even more problematic for compiler-writers, because if they just follow the interpretation on cppreference.com, they might end up providing acquire/release instructions that are weaker than programmers expect, which means that correctly-written programs could give the wrong results when they are executed.

In this post, I propose that scientists and engineers should consider plotting data that is expressed as a dimensionless ratio on a logarithmic scale.

(By a “dimensionless ratio”, I refer to data that is the ratio between two measurements that have the same dimension. Examples of such data include

  • the ratio between the execution time of a program before a proposed compiler optimisation has been applied and the execution time of that program afterwards, and
  • the power consumption of a proposed computer divided by the power consumption of a reference computer.)

In the vast majority of research papers published in computer architecture in the last year, when this data is plotted on a graph, a linear scale is used. I believe that a logarithmic scale would be more appropriate most of the time.

I will illustrate my reasons with reference to the two graphs below, which show some sort of “speedup” that has been obtained on four benchmark programs, A, B, C, and D. The left graph uses a linear y-axis, while the right one plots the same data on a logarithmic y-axis.

Screen Shot 2016-08-06 at 20.09.15.png

Four reasons for using a logarithmic axis

  1. The natural origin for a speedup ratio is 1, not 0. That is, we are primarily interested in seeing whether a data point lies above or below 1 (i.e., whether a speedup or a slowdown was observed) rather than whether it lies above or below 0. In the right-hand graph above, it is immediately obvious that A and B experience a slowdown; this is slightly less obvious in the left-hand graph.
  2. Going from a 1x speedup to a 2x speedup is surely more impressive than going from a 3x speedup to a 4x speedup. But on the linear y-axis in the left-hand graph above, the distance between 1 and 2 is the same as the distance between 3 and 4, so these feats appear equally impressive.
  3. Obtaining a 2x speedup is likely to be considered just as good as obtaining a 2x slowdown (i.e., a 0.5x speedup) is bad. But on the linear y-axis in the left-hand graph above, the distance from 1 to 0.5 is much smaller than the distance from 1 to 2, so the speedup experienced by benchmark C is emphasised over the slowdown experienced by benchmark B, even though both have the same magnitude.
  4. On a linear scale, the “centre of gravity” to which the eye is drawn lies at the arithmetic mean, while on a logarithmic scale, the centre of gravity is at the geometric mean. When averaging dimensionless ratios, most authors tend to use the geometric mean.

Caveat. One should be careful when the ratio being plotted could be zero, for then the geometric mean will degenerate to zero (it being the product of the individual ratios). The log scale will also fail to handle such cases because the logarithm of zero is negative infinity (which tends not to fit on the page).

Remark. There is considerable debate on the relative merits of the arithmetic mean, geometric mean, and harmonic mean when summarising speedup figures. (The debate is well summarised by Citron et al. and by Eeckhout.) I argue that whenever the geometric mean is deemed a suitable average – and it appears to be the most prevalent among the three means when summarising speedup figures – then a logarithmic scale should be used for the same reasons.

Examples from research research papers

The following graph is from Yizhou Zhang et al. Accepting Blame for Safe Tunneled Exceptions in PLDI 2016. The authors report that the graph demonstrates an average speedup of 2.4% speedup, and by examination, I deduce that this refers to the geometric mean. The y=1 line has been added by the authors to help the reader distinguish speedups from slowdowns; such an addition would be unnecessary if a logarithmic y-axis were used instead.

Screen Shot 2016-08-02 at 12.58.31

The following graph is from Michael D. Adams et al. On the Complexity and Performance of Parsing with Derivatives in PLDI 2016. The average speedup is given as 2.04, but it’s not clear whether this is the arithmetic or geometric mean.

Screen Shot 2016-08-02 at 22.20.54

The following graph is from James Bornholt et al. Optimizing Synthesis with Metasketches in POPL 2016.

Screen Shot 2016-08-03 at 08.53.49.png

The following graph is from Sergi Abadal et al. WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication in ASPLOS 2016. This graph reports both the arithmetic and geometric mean. Note that if a logarithmic y-axis had been used, it would (probably) not be necessary to have several bars extending beyond the top of the graph.

Screen Shot 2016-08-03 at 09.04.45.png

The following graph is from Xiaodong Wang and José F. Martínez ReBudget: Trading Off Efficiency vs. Fairness in Market-Based Multicore Resource Allocation via Runtime Budget Reassignment in ASPLOS 2016. It reports the geometric mean.

Screen Shot 2016-08-03 at 09.09.14.png

The following graph is from Haishan Zhu and Mattan Erez Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems [Paywall] in ASPLOS 2016. Their “FG” figures represent time ratios, and are averaged using the arithmetic mean, while their “BG” figures represent rate ratios, and are averaged using the harmonic mean.

Screen Shot 2016-08-03 at 09.13.24.png

The following graph is from Anurag Mukkara et al. Whirlpool: Improving Dynamic Cache Management with Static Data Classification in ASPLOS 2016. No mean speedup is calculated for these measurements in the paper; only the ranges are reported.

Screen Shot 2016-08-03 at 09.20.12.png

The following graphs are from the same paper. For these graphs, the authors quote the geometric mean speedups.

Screen Shot 2016-08-03 at 09.20.56.png

The following graph is from Myeongjae Jeon et al. TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services [Paywall] in ASPLOS 2016.

Screen Shot 2016-08-03 at 09.23.11.png

The following graph is from Tong Zhang et al. TxRace: Efficient Data Race Detection Using Commodity Hardware Transactional Memory [Paywall] in ASPLOS 2016. It gives the geometric mean.

Screen Shot 2016-08-03 at 09.26.33.png

The following graph is from Nils Asmussen et al. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores in ASPLOS 2016.

Screen Shot 2016-08-03 at 09.29.12.png

The following graph is from Daniyal Liaqat et al. Sidewinder: An Energy Efficient and Developer Friendly Heterogeneous Architecture for Continuous Mobile Sensing in ASPLOS 2016. No averages are given, only ranges.

Screen Shot 2016-08-03 at 09.31.29.png

The following graph is from Jonathan Balkind et al. OpenPiton: An Open Source Manycore Research Framework in ASPLOS 2016. No averages are given.

Screen Shot 2016-08-03 at 09.34.48.png

The following graphs are from Alex Markuze et al. True IOMMU Protection from DMA Attacks: When Copy is Faster than Zero Copy in ASPLOS 2016.

Screen Shot 2016-08-03 at 09.37.40.png

The following graph is from Amro Awad et al. Silent Shredder: Zero-Cost Shredding for Secure Non-Volatile Main Memory Controllers in ASPLOS 2016. This graph is quite interesting because the quoted 3.3x speedup is actually the arithmetic mean of all the per-benchmark speedups. The geometric mean would give an average speedup of 2.4x, and is probably the more appropriate measure in this case.

Screen Shot 2016-08-03 at 09.40.59.png

The same paper also reports a reduction in the number of writes to main memory (see the graph below). This graph could also use a logarithmic axis, but there is a problem with the H264 benchmark, which requires 0x as many writes to main memory when “Silent Shredder” is used – i.e., none at all. On a log scale, this data point would need to stretch to negative infinity. It also means that although calculating the arithmetic mean of these ratios feels suspect to me, the geometric mean does not work at all in this case.

Screen Shot 2016-08-06 at 16.26.15.png

The following graph is from Youngjin Kwon et al. Sego: Pervasive Trusted Metadata for Efficiently Verified Untrusted System Services in ASPLOS 2016. No overall averages are given.

Screen Shot 2016-08-03 at 10.39.45.png

The following graphs are from Saurav Muralidharan et al. Architecture-Adaptive Code Variant Tuning in ASPLOS 2016.

Screen Shot 2016-08-03 at 10.46.38.png

Further reading

  • Beth C. Gladen and Walter J. Rogan On graphing rate ratios in the American Journal of Epidemiology, 1983. This article argues that relative rates should be plotted on logarithmic rather than linear scales.
  • James R. Hebert and Donald R. Miller Plotting and discussion of rate ratios and relative risk estimates in the Journal of Clinical Epidemiology, 1989. This article argues that relative rates should actually be plotted not on logarithmic scales but on reciprocal scales!

 

It’s pretty common for database-driven apps to display a “last updated” field. For instance, here is a screenshot of my Mail app on my iPhone, showing that my emails were “Updated Just Now”.

photo

And here is a screenshot from the RealTimeTrains app, showing that the information was “Updated a few seconds ago”.

photo 2

Why do apps insist on using relative timestamps like this? It’s all very well having a piece of text to say that the information was “just updated”, but how do I know that the piece of text itself is up-to-date?

(It’s a bit like those “Lost Cat” posters that say “missing since last Friday” and are then left up for months.)

If instead the text gave an absolute timestamp like “Updated at 07:57”, then I would know that the information is up to date, and that the app hasn’t simply gotten stuck.

I transcribed Barry Gray’s theme song from Stingray, Aqua Marina, into piano sheet music. There’s a vocal part, replete with lyrics, plus a keyboard part comprising a right-hand and chord symbols. It roughly corresponds to what you might find in a musician’s “fake book”.

Available as a pdf file, as an editable Noteworthy Composer file, and an mp3 audio file (recorded on my iPhone).

 

Here are some tips for travelling from Cambridge to London. Prices fairly accurate in April 2015.

By coach

National Express operates a service between Cambridge and London. It’s not a well-used service, chiefly because it takes a good two and a half hours. However, note that if you don’t go all the way to Victoria but instead jump onto the Underground network at, say, Mile End, you can slice off the last hour or so. It’s also worth noting that 10-trip tickets are available (only from the driver). These work out as £5.65 for each one-way journey, so they are cheaper (and quicker) than buying advance tickets online, but they do not guarantee a seat (not that getting a seat has ever posed a problem for me).

By train

Consider taking the train to Liverpool Street (LST) rather than the default King’s Cross (KGX). There are “slow” and “fast” trains to both destinations, with the fast LST trains taking about 70 mins compared to the fast KGX trains taking about 50 mins. Although the LST route is a little slower, it has a number of advantages. First, it is much less crowded. Second, it is noticeably cheaper: if you travel with a Network Railcard after 10am, you can get an day return (which includes a Zone 1-6 London Travelcard, and has no evening peak restrictions) for £16.75; the same journey to KGX costs £20.80 (with Travelcard) and you cannot return between 1630 and 1900. Both the LST and KGX routes use trains of varying quality; the LST route has both the best trains and the worst. Those providing the 1004 and 1021 service to LST, and the 1707, 1737 and 1807 services back to Cambridge, are the best trains, replete with large tables, comfortable seats, and free (if rather intermittent) wifi.

img_2092-2When designing the English character set, it was quite a bad idea to make the letter O (“oh”) look so similar to the number 0 (zero).

But what turned a bad situation into an absolute disaster was the convention of pronouncing the number 0 as “oh” – as in “double-oh seven”, or the freephone number “oh eight-hundred”. At least if the convention had been to pronounce the number 4 as “dee”, one could potentially resolve the ambiguity by looking at the shape of the character. As things are, a vertically elongated circle called “oh” can equally refer to the 15th letter of the alphabet or to the number zero.

Sentient beings can often infer whether a given ovoid is intended as a number or a letter from its context. But computers cannot, and in an increasingly computerised world, this is an increasing problem. A phone number entered as O800 123 456 can be successfully interpreted by a human, but an unthinking machine will reject it as invalid. My friend, who thought for many years that his postcode was “DE3 OAB” until I pointed out to him that the third character from the end of any postcode has to be a number, had no trouble receiving mail handled by human operators, but would have struggled to resolve his delivery address on Amazon.

Here’s another tricky area: car registration numbers. Cars built between 1983 and 2001 are identified by a 6- or 7-character sequence comprising a letter, followed by two or three numbers, followed by three further letters. Those built more recently are identified by a 7-character sequence of two letters, two numbers, and three letters. Bearing in mind that the font in which these characters are printed makes no distinction between zeroes and ohs, it is impossible to judge whether the numberplate FO12ABC belongs to a 27-year-old car or a 3-year-old car! (That is, unless you happen to know that in the older system, the three-number sequence never actually began with a zero.)

And it’s not just zero and oh that can cause alphanumerical confusion. The uppercase letter I (“aye”) looks suspiciously similar to the number 1 (one), and to the lowercase l (“ell”). In fact, legend has it that the Queen utters sentences like “One would like to watch telly now” simply because she misread the “I” for a “1” a long time ago and has continued the habit ever since.