If you want to compute something, you have two broad options: do it in software, or do it in hardware. A custom piece of hardware can give you the ultimate performance, but can be extremely hard to design. On the other hand, software can be quite easy to write and debug, but running software on a general-purpose processor is relatively inefficient.
An attempt to obtain the best of both worlds is made by high-level synthesis (HLS). An HLS tool takes a piece of software and automatically translates it into hardware. The programmer hence only has to go to the effort of writing software, but obtains the performance benefits of having custom hardware.
The key challenge in HLS is not simply generating hardware, but generating efficient hardware, and also ensuring that the generated hardware is correct — i.e., that it always gives the same outcome as the software provided by the programmer.
This piece of work is one contribution towards tackling that challenge. The context of our work is the HLS of multi-threaded software — that is, software that has been designed so that multiple sequences of instructions (or ‘threads’) can be carried out simultaneously.
There have been a few pieces of work that look at how HLS tools can handle multi-threaded software, but one shortcoming that is common to all of them is that they all do the software-to-hardware translation by working on each thread in isolation. This is a natural starting point, because conventional compilers also work by compiling different pieces of a program separately. It is rare for a compiler to have access to a whole program in one go, because most programs are chunked up into libraries that are written and compiled by many different programmers. However, HLS tools differ from conventional compilers in one important respect: they do have the entire input program available to them. Hence, HLS tools do not need to work on each thread in isolation. Our work demonstrates that by maintaining an awareness of the other threads in the program, each thread can be translated into more efficient hardware.
A little example
Suppose I have a piece of software that consists of three threads. I have four variables:
data2, which are initially zero, and
ready2, which are initially FALSE.
- A1. Set
- A2. Set
- A3. Set
- A4. Set
- B1. Load
- B2. If it is TRUE, then load
- C1. Load
- C2. If it is TRUE, then load
Note that the
ready variables are only set by Thread A once the respective
data variables have been set. Hence, if Thread B sees that
ready1 has been set, it must read
data1 as 42 (not 0). And if Thread C sees that
ready2 has been set, it must read
data2 as 17 (not 0).
Now, as part of the hardware generation process, HLS tools have to schedule software instructions to hardware clock ticks. In order to generate efficient hardware, the tools aim to pack as the software instructions into as few clock ticks as possible. One way to schedule Thread A would be to map its four instructions to four consecutive clock ticks – but can we do better? If we maintain an awareness of what Threads B and C might be doing, then we can. We must preserve the order of instructions A1 and A2 (for otherwise Thread B might read
data1 as 0), and we must preserve the order of instructions A3 and A4 (for otherwise Thread C might read
data2 as 0), but otherwise the four instructions in Thread A can be arbitrarily reordered or parallelised. In fact, we can pack them into just two clock ticks: one clock tick for A1 and A3 together, followed by one clock tick for A2 and A4 together.
Importantly, it is only possible to schedule Thread A efficiently like this because we checked what the other threads are doing, and we determined that they could not possibly notice if we reordered instructions A2 and A3. If we only had Thread A available, then we would have to preserve the strict order of all of its instructions, and hence generate less efficient hardware.
This example is rather small, but we have implemented our analysis as part of an optimisation to the LegUp HLS tool and evaluated it on several real-world multi-threaded programs. Our results indicate that, on average, our optimisation leads to hardware that is about 60% faster.
I mentioned above that there were two parts to the HLS challenge: generating efficient hardware, and generating correct hardware. So far, I have discussed how we improve the efficiency of the generated hardware. But our work also tackles the correctness issue, because we encoded our analysis in our Memalloy modelling tool, and checked that, for all programs of a bounded size, the schedule that our analysis generates is guaranteed to be correct. This kind of check is worth doing because our analysis handles multi-threaded programs that use quite advanced instructions called weakly consistent atomic operations, and these are notoriously tricky to implement correctly.