If you’ve ever attended a seminar about weak memory models, chances are good that you’ve been shown a small concurrent program and asked to ponder what is allowed to happen if its threads are executed on two or three different cores of a multicore CPU. For instance, you might be shown this program:
// Thread 1: || // Thread 2: x := 1; || y := 1; r0 := y; || r1 := x;
and asked what values can end up in the registers
r1 if the program is executed on a multicore x86 machine, assuming that the shared locations
y are both initially set to 0. The speaker might enjoy informing audience members unfamiliar with the intricacies of x86 memory semantics that the seemingly impossible outcome
r0=r1=0 is actually allowed.
In a paper that Dan Iorga, Tyler Sorensen, Ally Donaldson and I will shortly present at OOPSLA 2021, we ask a slightly different question: what is allowed to happen if one of those threads is executed by a CPU core, and the other is simultaneously executed by a custom hardware design implemented on an FPGA?
That’s not quite as contrived a situation as you might think. In our era of heterogeneity, computing is increasingly being done by systems-on-chip that are made up of several different computing components (conventional CPUs, powerful GPUs, and programmable FPGAs), all with the potential to work together on a problem, each tackling the part that they are best suited to.
So what might happen if we run that program on a system-on-chip such as Intel’s recently-released Xeon+FPGA processor, which packs a multicore Xeon CPU and an Arria FPGA onto a single chip?
// Thread 1 (CPU): || // Thread 2 (FPGA): x := 1; || y := 1; r0 := y; || r1 := x;
There are several reasons why weird behaviours, like the
r0=r1=0 one we saw earlier, might still be allowed to happen. For one, the write buffers on the x86 CPU might cause the write of
x to be delayed until after the read of
y. For another, the FPGA might allow similar reordering; indeed, on the particular system-on-chip that we studied, memory access requests by the FPGA can be batched up and reordered before being dispatched. For a third, there are multiple channels that link the FPGA to the memory that it shares with the CPU, and memory requests in one channel might overtake those in another channel.
And there are other examples of concurrent programs where the CPU/FPGA combo can yield behaviours even more weird than those allowed on the CPU alone.
In our work, we’ve designed a mathematical model that explains which memory accesses by the CPU and by the FPGA can get reordered, and which ones can’t. We’ve validated our model by checking that it lines up with the available documentation from Intel, by talking it through with some Intel engineers who designed parts of the chip, and by generating and running hundreds of little concurrent programs as test-cases on a real machine (access to which was kindly provided to us by the Intel Labs Academic Compute Environment).
Our work shows how we can begin to reason about the correctness of small concurrent programs for these systems-on-chip. For instance, we use an encoding of our model in the CBMC model checker to confirm that a queue datastructure implemented between the CPU and the FPGA is implemented correctly. On the flip side, we’ve also conducted some experiments that show what can happen if this datastructure is implemented incorrectly – that is, we’ve shown that if some synchronisation operations are omitted, then the queue can indeed lose or duplicate some elements. (Which might not be the end of the world if you’re into approximate computing.)
Ultimately, we see this work as an important step towards providing a firm foundation for reasoning about programs that run on, and compilers that target, the hybrid computing devices that are fast becoming a central part of our computing landscape.