Introducing C4: the C Compiler Concurrency Checker

This post is about a paper by Matt Windsor, Ally Donaldson, and myself that will presented shortly at the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) conference, in the tool demonstrations track.

Compilers are a central pillar of our computing infrastructure, so it’s really important that they work correctly. There’s been a lot of work in the last decade or so on testing compilers by feeding them randomly generated programs and checking that they are compiled correctly. But most of this work has been limited to single-threaded (sequential) programs, which leaves open the question of whether compilers compile multi-threaded (concurrent) programs correctly.

There’s a good reason why existing work concentrates on the single-threaded case. The typical way to check that a randomly generated program has been compiled correctly is to compile it twice – whether that’s with two different compilers, or with two different sets of compiler flags, or before and after applying some semantics-preserving transformations – and see whether there’s any discrepancy between the two outputs. If there is, then one of the compilers must be doing something wrong! But this reasoning only works if every randomly generated program is deterministic – which multi-threaded programs seldom are.

So how can we expand compiler testing to the multi-threaded case? Our idea is to run the following sequence of steps in a loop:

  1. First, we generate a small multi-threaded program that is potentially tricky. By ‘small’ we mean around five or six instructions spread across two or three threads. And by ‘potentially tricky’ we mean that the program exercises a corner-case of the notoriously complicated C concurrency model. Lustig et al. showed in ASPLOS 2017 how to use the Alloy Analyzer to automatically generate suites of such programs directly from an axiomisation of the C concurrency model, so we adopt their technique here. Importantly, each program in this suite comes with a postcondition that characterises all the final states that the program is allowed to reach.
  2. Second, in the hope of making that small multi-threaded program more of a challenge for the compiler-under-test, we apply a series of semantics-preserving transformations to it, in the spirit of Lidbury et al.’s CLSmith tool, ending up with a large multi-threaded program. For instance, we might replace
      x := 42
      if (getenv("john")) {
        x := 42
      } else {
        // lots of randomly
        // generated instructions

    and then always run the generated code in an environment where the variable john has been set to 1. Actually we don’t need our transformations to be semantics-preserving in both directions; we just need to make sure that our transformations do not introduce new allowed outcomes (otherwise our tool might label correct behaviours as bugs). This means that we can also employ transformations like ‘randomly insert a fence’ or ‘randomly strengthen the consistency mode of an atomic operation from relaxed to release/acquire or from release/acquire to sequentially consistent‘. Importantly, because the small randomly generated program came with a postcondition, and because our transformations preserved that postcondition, we still know how our program should behave, even though it has now become rather large and unwieldy.
  3. Third, we compile our large multi-threaded program using the compiler-under-test, with randomly chosen command-line options. If the compiler crashes, then we have found a bug.
  4. Fourth, we execute the compiled program on a real machine. We execute the program thousands of times, in the presence of several ‘stressing’ threads that hammer on nearby memory locations in the hope of provoking weird behaviours. This hostile environment is created using the Litmus tool of Alglave et al. If the postcondition is violated on any of these runs, then we have found a bug.

This technique has been implemented in a tool called C4, which is available on GitHub. We applied C4 to several versions of four compilers (GCC, LLVM, the Intel C Compiler, and IBM XL) targetting three architectures (x86, IBM POWER9, and Arm8). C4 revealed two concurrency-related bugs in old versions of GCC, as well two previously-unknown bugs (which turned out, after test-case reduction, not to be related to concurrency) in IBM XL and GCC.

Of course, this is not as many bugs as we might have liked to have found! On the one hand, we suspect that if C4 employed more imaginative transformations, or if its testing throughput were optimised further, then it might be able to uncover more bugs, and this is a direction we’re pursuing. But on the other hand, we also have a growing suspicion that there may not be that many concurrency-related bugs in mainstream compilers in the first place. Perhaps compiler engineers make conservative choices when it comes to implementing concurrency, concerned by the inherent difficulties of testing multi-threaded code. Perhaps they do not shoot for super-aggressive optimisations like they do when compiling sequential code, and thus their code is less likely to contain the kind of faults that other compiler-testing tools have been so successful at exposing. Indeed, we experimented with injecting artificial bugs into LLVM, capturing the kinds of concurrency-related mistakes we’ve seen in other compilers, and found that C4 was effective at catching those. So, perhaps C4 will be end up being most valuable not for exposing existing bugs in compilers, but for giving compiler engineers the confidence to go ahead and implement more ambitious optimisations, safe in the knowledge that if they do introduce a bug, C4 will likely catch it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s