It is quite well-known that the memory operations and fences provided by the C/C++11 languages are pretty complicated and confusing. For instance, my OOPSLA 2015 paper (joint with Batty, Beckmann, and Donaldson) explains that a proposed implementation of these instructions for next-generation AMD graphics cards is faulty because the designers misunderstood how they worked. And back in 2011, Batty et al. showed that (an early draft of) the C++11 language failed to enforce a desirable property called ‘sequential consistency for data-race-free programs’, as a result of a subtlety in the way some of these memory operations were being defined.

Today I found another example of the confusing arising from memory operations and fences in C/C++11. The cppreference.com website is a long-established, well-regarded, and widely-used reference for C and C++ programmers, but its description of how the “acquire/release” memory operations and fences work is quite wrong.

The website claims that acquire loads cannot be reordered with subsequent reads:

Screen Shot 2016-08-11 at 11.40.35.png

and that release stores cannot be reordered with preceding writes. It also claims that a store that follows a release fence cannot be reordered with any write that precedes the fence:

Screen Shot 2016-08-11 at 12.44.05.png

This is incorrect (or more precisely, it is not the whole truth). The truth, according to the official C/C++11 language specifications, is that

  • acquire loads cannot be reordered with any subsequent memory accesses (not just reads), and
  • release stores cannot be reordered with any preceding memory accesses (not just writes).

Similarly,

  • release fences prevent stores that follow the fence being reordered with any memory access that precedes the fence (not just writes), and
  • acquire fences prevent loads that precede the fence being reordered with any memory access that follows the fence.

Here is an example that demonstrates the different interpretations of acquire loads. Suppose we have two threads:

int main() {
  atomic_int x=0; atomic_int y=0;
  {{{ {
    // Thread 1:
    int r0 = atomic_load_explicit(&x, memory_order_relaxed); //a
    atomic_store_explicit(&y, 1, memory_order_release);      //b
  } ||| {
    // Thread 2:
    int r1 = atomic_load_explicit(&y, memory_order_acquire); //c
    atomic_store_explicit(&x, 1, memory_order_relaxed);      //d
  } }}}
}

Can we obtain r0 = 1 and r1 = 1 at the end? According to the official standard, the answer is no. This can be confirmed by copying the code above into the CppMem tool (which enumerates all the behaviours of a small C/C++ program that are allowed by the standard).

But if we follow the guidelines from cppreference.com and allow the acquire load (c) to be reordered with the subsequent relaxed store (d), then it is straightforward to obtain r0 = r1 = 1, simply by executing the instructions in the following order: d, a, b, c.

According to cppreference.com, acquire/release instructions are less powerful than they are in the official standard. This is problematic for programmers, because they might end up using stronger instructions than they really need, which may make their programs less efficient. It’s even more problematic for compiler-writers, because if they just follow the interpretation on cppreference.com, they might end up providing acquire/release instructions that are weaker than programmers expect, which means that correctly-written programs could give the wrong results when they are executed.

In this post, I propose that scientists and engineers should consider plotting data that is expressed as a dimensionless ratio on a logarithmic scale.

(By a “dimensionless ratio”, I refer to data that is the ratio between two measurements that have the same dimension. Examples of such data include

  • the ratio between the execution time of a program before a proposed compiler optimisation has been applied and the execution time of that program afterwards, and
  • the power consumption of a proposed computer divided by the power consumption of a reference computer.)

In the vast majority of research papers published in computer architecture in the last year, when this data is plotted on a graph, a linear scale is used. I believe that a logarithmic scale would be more appropriate most of the time.

I will illustrate my reasons with reference to the two graphs below, which show some sort of “speedup” that has been obtained on four benchmark programs, A, B, C, and D. The left graph uses a linear y-axis, while the right one plots the same data on a logarithmic y-axis.

Screen Shot 2016-08-06 at 20.09.15.png

Four reasons for using a logarithmic axis

  1. The natural origin for a speedup ratio is 1, not 0. That is, we are primarily interested in seeing whether a data point lies above or below 1 (i.e., whether a speedup or a slowdown was observed) rather than whether it lies above or below 0. In the right-hand graph above, it is immediately obvious that A and B experience a slowdown; this is slightly less obvious in the left-hand graph.
  2. Going from a 1x speedup to a 2x speedup is surely more impressive than going from a 3x speedup to a 4x speedup. But on the linear y-axis in the left-hand graph above, the distance between 1 and 2 is the same as the distance between 3 and 4, so these feats appear equally impressive.
  3. Obtaining a 2x speedup is likely to be considered just as good as obtaining a 2x slowdown (i.e., a 0.5x speedup) is bad. But on the linear y-axis in the left-hand graph above, the distance from 1 to 0.5 is much smaller than the distance from 1 to 2, so the speedup experienced by benchmark C is emphasised over the slowdown experienced by benchmark B, even though both have the same magnitude.
  4. On a linear scale, the “centre of gravity” to which the eye is drawn lies at the arithmetic mean, while on a logarithmic scale, the centre of gravity is at the geometric mean. When averaging dimensionless ratios, most authors tend to use the geometric mean.

Caveat. One should be careful when the ratio being plotted could be zero, for then the geometric mean will degenerate to zero (it being the product of the individual ratios). The log scale will also fail to handle such cases because the logarithm of zero is negative infinity (which tends not to fit on the page).

Remark. There is considerable debate on the relative merits of the arithmetic mean, geometric mean, and harmonic mean when summarising speedup figures. (The debate is well summarised by Citron et al. and by Eeckhout.) I argue that whenever the geometric mean is deemed a suitable average – and it appears to be the most prevalent among the three means when summarising speedup figures – then a logarithmic scale should be used for the same reasons.

Examples from research research papers

The following graph is from Yizhou Zhang et al. Accepting Blame for Safe Tunneled Exceptions in PLDI 2016. The authors report that the graph demonstrates an average speedup of 2.4% speedup, and by examination, I deduce that this refers to the geometric mean. The y=1 line has been added by the authors to help the reader distinguish speedups from slowdowns; such an addition would be unnecessary if a logarithmic y-axis were used instead.

Screen Shot 2016-08-02 at 12.58.31

The following graph is from Michael D. Adams et al. On the Complexity and Performance of Parsing with Derivatives in PLDI 2016. The average speedup is given as 2.04, but it’s not clear whether this is the arithmetic or geometric mean.

Screen Shot 2016-08-02 at 22.20.54

The following graph is from James Bornholt et al. Optimizing Synthesis with Metasketches in POPL 2016.

Screen Shot 2016-08-03 at 08.53.49.png

The following graph is from Sergi Abadal et al. WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication in ASPLOS 2016. This graph reports both the arithmetic and geometric mean. Note that if a logarithmic y-axis had been used, it would (probably) not be necessary to have several bars extending beyond the top of the graph.

Screen Shot 2016-08-03 at 09.04.45.png

The following graph is from Xiaodong Wang and José F. Martínez ReBudget: Trading Off Efficiency vs. Fairness in Market-Based Multicore Resource Allocation via Runtime Budget Reassignment in ASPLOS 2016. It reports the geometric mean.

Screen Shot 2016-08-03 at 09.09.14.png

The following graph is from Haishan Zhu and Mattan Erez Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems [Paywall] in ASPLOS 2016. Their “FG” figures represent time ratios, and are averaged using the arithmetic mean, while their “BG” figures represent rate ratios, and are averaged using the harmonic mean.

Screen Shot 2016-08-03 at 09.13.24.png

The following graph is from Anurag Mukkara et al. Whirlpool: Improving Dynamic Cache Management with Static Data Classification in ASPLOS 2016. No mean speedup is calculated for these measurements in the paper; only the ranges are reported.

Screen Shot 2016-08-03 at 09.20.12.png

The following graphs are from the same paper. For these graphs, the authors quote the geometric mean speedups.

Screen Shot 2016-08-03 at 09.20.56.png

The following graph is from Myeongjae Jeon et al. TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services [Paywall] in ASPLOS 2016.

Screen Shot 2016-08-03 at 09.23.11.png

The following graph is from Tong Zhang et al. TxRace: Efficient Data Race Detection Using Commodity Hardware Transactional Memory [Paywall] in ASPLOS 2016. It gives the geometric mean.

Screen Shot 2016-08-03 at 09.26.33.png

The following graph is from Nils Asmussen et al. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores in ASPLOS 2016.

Screen Shot 2016-08-03 at 09.29.12.png

The following graph is from Daniyal Liaqat et al. Sidewinder: An Energy Efficient and Developer Friendly Heterogeneous Architecture for Continuous Mobile Sensing in ASPLOS 2016. No averages are given, only ranges.

Screen Shot 2016-08-03 at 09.31.29.png

The following graph is from Jonathan Balkind et al. OpenPiton: An Open Source Manycore Research Framework in ASPLOS 2016. No averages are given.

Screen Shot 2016-08-03 at 09.34.48.png

The following graphs are from Alex Markuze et al. True IOMMU Protection from DMA Attacks: When Copy is Faster than Zero Copy in ASPLOS 2016.

Screen Shot 2016-08-03 at 09.37.40.png

The following graph is from Amro Awad et al. Silent Shredder: Zero-Cost Shredding for Secure Non-Volatile Main Memory Controllers in ASPLOS 2016. This graph is quite interesting because the quoted 3.3x speedup is actually the arithmetic mean of all the per-benchmark speedups. The geometric mean would give an average speedup of 2.4x, and is probably the more appropriate measure in this case.

Screen Shot 2016-08-03 at 09.40.59.png

The same paper also reports a reduction in the number of writes to main memory (see the graph below). This graph could also use a logarithmic axis, but there is a problem with the H264 benchmark, which requires 0x as many writes to main memory when “Silent Shredder” is used – i.e., none at all. On a log scale, this data point would need to stretch to negative infinity. It also means that although calculating the arithmetic mean of these ratios feels suspect to me, the geometric mean does not work at all in this case.

Screen Shot 2016-08-06 at 16.26.15.png

The following graph is from Youngjin Kwon et al. Sego: Pervasive Trusted Metadata for Efficiently Verified Untrusted System Services in ASPLOS 2016. No overall averages are given.

Screen Shot 2016-08-03 at 10.39.45.png

The following graphs are from Saurav Muralidharan et al. Architecture-Adaptive Code Variant Tuning in ASPLOS 2016.

Screen Shot 2016-08-03 at 10.46.38.png

Further reading

  • Beth C. Gladen and Walter J. Rogan On graphing rate ratios in the American Journal of Epidemiology, 1983. This article argues that relative rates should be plotted on logarithmic rather than linear scales.
  • James R. Hebert and Donald R. Miller Plotting and discussion of rate ratios and relative risk estimates in the Journal of Clinical Epidemiology, 1989. This article argues that relative rates should actually be plotted not on logarithmic scales but on reciprocal scales!

 

It’s pretty common for database-driven apps to display a “last updated” field. For instance, here is a screenshot of my Mail app on my iPhone, showing that my emails were “Updated Just Now”.

photo

And here is a screenshot from the RealTimeTrains app, showing that the information was “Updated a few seconds ago”.

photo 2

Why do apps insist on using relative timestamps like this? It’s all very well having a piece of text to say that the information was “just updated”, but how do I know that the piece of text itself is up-to-date?

(It’s a bit like those “Lost Cat” posters that say “missing since last Friday” and are then left up for months.)

If instead the text gave an absolute timestamp like “Updated at 07:57”, then I would know that the information is up to date, and that the app hasn’t simply gotten stuck.

I transcribed Barry Gray’s theme song from Stingray, Aqua Marina, into piano sheet music. There’s a vocal part, replete with lyrics, plus a keyboard part comprising a right-hand and chord symbols. It roughly corresponds to what you might find in a musician’s “fake book”.

Available as a pdf file, as an editable Noteworthy Composer file, and an mp3 audio file (recorded on my iPhone).

 

Here are some tips for travelling from Cambridge to London. Prices fairly accurate in April 2015.

By coach

National Express operates a service between Cambridge and London. It’s not a well-used service, chiefly because it takes a good two and a half hours. However, note that if you don’t go all the way to Victoria but instead jump onto the Underground network at, say, Mile End, you can slice off the last hour or so. It’s also worth noting that 10-trip tickets are available (only from the driver). These work out as £5.65 for each one-way journey, so they are cheaper (and quicker) than buying advance tickets online, but they do not guarantee a seat (not that getting a seat has ever posed a problem for me).

By train

Consider taking the train to Liverpool Street (LST) rather than the default King’s Cross (KGX). There are “slow” and “fast” trains to both destinations, with the fast LST trains taking about 70 mins compared to the fast KGX trains taking about 50 mins. Although the LST route is a little slower, it has a number of advantages. First, it is much less crowded. Second, it is noticeably cheaper: if you travel with a Network Railcard after 10am, you can get an day return (which includes a Zone 1-6 London Travelcard, and has no evening peak restrictions) for £16.75; the same journey to KGX costs £20.80 (with Travelcard) and you cannot return between 1630 and 1900. Both the LST and KGX routes use trains of varying quality; the LST route has both the best trains and the worst. Those providing the 1004 and 1021 service to LST, and the 1707, 1737 and 1807 services back to Cambridge, are the best trains, replete with large tables, comfortable seats, and free (if rather intermittent) wifi.

Oh no

When designing the English character set, it was quite a bad idea to make the letter O (“oh”) look so similar to the number 0 (zero).

But what turned a bad situation into an absolute disaster was the introduction of the convention of pronouncing the number 0 as “oh” – as in “double-oh seven”, or the freephone number “oh eight-hundred”. At least if the convention had been to pronounce the number 4 as “dee”, one could potentially resolve the ambiguity by looking at the shape of the character. As things are, a vertically-stretched circle called “oh” can equally refer to the 15th letter of the alphabet or to the number zero.

Sentient beings can often infer whether a given ovoid is intended as a number or a letter from its context. But computers cannot, and in an increasingly computerised world, this is an increasing problem. A phone number entered as O800 123 456 can be succesfully interpreted by a human, but an unthinking machine will reject it as invalid. My friend who thought, for many years, that his postcode was “DE3 OAB” until I pointed out to him that the third character from the end had to be a number, had no trouble receiving mail handled by human operators, but would have struggled to resolve his delivery address on Amazon.

Here’s another tricky area: car registration numbers. Cars built between 1983 and 2001, are identified by a 6- or 7-character sequence comprising a letter, followed by two or three numbers, followed by three further letters. Those built more recently are identified by a 7-character sequence of two letters, two numbers, and three letters. Bearing in mind that the font in which these characters are printed does not clearly distinguish zeroes from ohs, it is quite impossible to judge whether car FO12ABC is 27 years old or just 3 years old. Unless the car is in front of you, that is.

And it’s not just zero and oh that can cause alphanumerical confusion. The uppercase letter I (“aye”) looks suspiciously similar to the number 1 (one), and to the lowercase l (“ell”). In fact, legend has it that the Queen utters sentences like “One would like to watch telly now” simply because she misread the “I” for a “1” a long time ago and has continued the habit ever since.

Here are several really useful free Mac apps that I’ve come across recently:

  • Macfusion (free). Lets you mount remote filesystems over SSH or FTP, and access them through Finder. Far nicer than using the ssh command in Terminal, or a third-party FTP client. Depends on OSXFUSE. Notes:
    • Finder does already provide FTP access (via “Connect to server…”), but it’s read-only.
    • In order to have write access over SSH, you’ll need to do the trick described here to map your local username to the username of your SSH account.
    • I find it convenient to place a link to Macfusion.app in Finder’s toolbar.
  • BetterTouchTool (free). Brings to Mac the excellent ‘aero-snap’ feature of Windows 7 (plus a few other nice features).
  • Spark (free). Lets you assign custom keyboard shortcuts. Really stable and powerful. Doesn’t look to have been updated for a couple of years, but still works fine on Snow Leopard.
  • VirtualBox (free). Run Windows or Linux on your Mac – and no need to dual boot. I’m staggered by how well this works.
  • BibDesk (free). I use this to manage my PDF library. Very powerful and customisable.
  • PDFNut (free). A PDF viewer with Chrome-style tabs. I only started using it yesterday, and I love it already.