Languages like C and C++ allow programmers to write concurrent programs. These are programs whose instructions are partitioned into multiple threads, all of which can be run at the same time. To avoid problematic situations where instructions in two different threads want to access the same piece of memory at the same time, conventional concurrent programs use locks. These work in a straightforward way: shared pieces of memory can only be accessed by a thread that holds the appropriate lock, and only one thread can hold a lock at a time. Yet acquiring and releasing locks imposes significant overhead. This has led to the emergence of the lock-free concurrent programming style, which manages to avoid locks by very carefully orchestrating each thread’s sequence of memory accesses.
It is well-known how to translate lock-free C++ programs into machine code that can be understood by conventional processors. But C++ programs can also be translated directly into hardware circuitry, and (until now) little attention has been paid to the question of how to create hardware for the class of lock-free C++ programs. This is a shame because, in the realm of conventional processors, lock-free programs have been conclusively demonstrated to be the fastest class of programs in the world.
In this work, we start with a tool, called LegUp, for translating C++ into hardware, and extend it so that it can cope with lock-free programs. We evaluated our work by using LegUp to create a piece of hardware from a widely-used lock-free program called a single-producer single-consumer circular queue, both before and after adding our extension. We found that the hardware generated by our extended LegUp is roughly twice as fast as that generated by the original tool (which relies on locks).
Going further, we investigate how to generate hardware for lock-free C++ programs that exploit relaxed concurrency. In relaxed concurrency, the programmer writes their code in such a way that the program will still work if certain instructions are performed out-of-order. Although programming in this way is very challenging, it can give the machine that runs the code the flexibility to rearrange the instructions into whichever order is the most efficient. Like many modern lock-free programs, the circular queue mentioned above is written using the relaxed concurrency style, but the original LegUp tool does not exploit the reordering opportunities that this affords (and nor do any other hardware synthesis tools of which we are aware).
In our work, we further extended LegUp so that it can detect when it is given a program that uses relaxed concurrency, and can generate more efficient hardware accordingly. On our circular queue case study, we found that making LegUp sensitive to relaxed concurrency enabled it to generate hardware that is roughly half as fast again (i.e., three times faster than the original lock-based hardware).
Software-to-hardware translators, such as LegUp, are becoming increasingly important as “programmable hardware” devices (called FPGAs) become increasingly powerful and popular. Our work has shown that it is possible to use these translators not just on run-of-the-mill sequential programs, but on high-performance programs that exploit adventurous features such as lock-freedom and relaxed concurrency.