Highlights from FPGA 2020

Here are a few personal highlights from the FPGA 2020 conference, which took place this week in Seaside, California. (Main photo credit: George Constantinides.)

Jakub Szefer‘s invited talk on “Thermal and Voltage Side Channels and Attacks in Cloud FPGAs” described a rather nifty side-channel through which secrets could be leaked in the context of cloud-based FPGAs. The context is that one user of an FPGA would like to transmit information secretly to the next user of the same FPGA. Jakub suggests transmitting this information via the physical temperature of the FPGA. His proposal is that if the first user would like to transmit a ‘1’, they heat up the FPGA by running a circuit that uses a lot of current, such as a ring oscillator; if they would like to transmit a ‘0’, they don’t. The second user, once they arrive, measures the temperature of the FPGA. Jakub showed that even if there is a reasonable gap between the first user finishing and the second user starting, the measurement can still be taken fairly reliably because the heat is dissipated quite slowly. Follow-on work asks whether it is possible to transmit more information by just heating up (and then measuring the temperature of) parts of the FPGA.

Stefan Nikolic‘s paper on “Straight to the Point: Intra- and Inter-Cluster LUT Connections to Mitigate the Delay of Programmable Routing” (with Grace Zgheib and Paolo Ienne) struck me as a nice example of ‘out of the box’ thinking. FPGAs are designed to allow any two LUTs to be connected to one another, but because of their rigid structure, the route between them may be a little slow and circuitous. Stefan’s paper asks what would happen if we could just stick a direct wire between certain pairs of LUTs, like you can when designing fully-custom hardware. It’s a hypothetical question, of course, because FPGAs don’t actually allow this. But Stefan’s experiments using a simulator suggest that some significant performance improvements could be obtained if they did, so perhaps the FPGA manufacturers should bear this in mind for the future!

Philip Leong‘s presentation on “LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations” (on joint work with Seyedramin Rasoulinezhad, Siddhartha, Hao Zhou, Lingli Wang, and David Boland) was motivated by the observation that in several important applications, a lot of the LUTs on an FPGA are employed to implement XOR operations. His proposal is to modify the design of the FPGA itself to add dedicated XOR gates for handling these operations. The idea is that dedicated XOR gates can implement XOR operations rather more efficiently than general-purpose LUTs can. Moreover, by moving XOR operations out of the LUTs, those LUTs can be more usefully employed for other things. I found it rather interesting to note the connection between the work Philip described and some previous work from Imperial on LUTNet. Both take as their starting point the observation that a lot of LUTs are being used to implement XOR operations. But where Philip’s paper suggests moving these XORs to their own hardware, LUTNet suggests replacing the XOR operations with different ones that can still be implemented on a LUT but may be more suitable for that particular design, thus exploiting the flexibility of LUTs. Perhaps future work will see both solutions combined together.

Vinod Kathail from Xilinx gave a keynote about the new Vitis high-level synthesis (HLS) framework. Vitis HLS seems quite appealing to me, for two reasons. One is that it seems to expose rather more of the internal stages of the HLS process than Xilinx’s previous tools, which should make it easier for us to prototype research ideas for HLS. The other is that Vitis HLS promises not to need the plethora of ‘pragma’ annotations that current HLS tools expect programmers to provide. If it is really the case that Vitis HLS can manage to produce performant hardware without relying on these hints, then we may be finally heading towards a world where software programmers really can produce hardware designs without needing a degree in electrical engineering!

Sameh Attia‘s paper on “StateMover” (with Vaughn Betz) described a fascinating system for debugging hardware that sits halfway between simulating it and executing it. The advantage of simulation is its flexibility: the designer can inspect (or even change) the contents of every component, and they can run the clock forwards or backwards until they pinpoint the problem in their design. But the disadvantage of simulation is its slowness: simulating a design can take about 100000 times longer than executing it properly. Sameh’s work describes a way to move the entire state of a design back and forth between an FPGA and a software simulator. His proposal is that a design starts running on an FPGA, then at some predetermined trigger point, the clock is stopped and the current state of the FPGA is copied over to the simulator. The designer can then investigate what is going on at this point, and maybe step the clock forwards or backwards a bit, and then they can copy the new state back to the FPGA and carry on running it. The best of both worlds is thus achieved: the fiddly parts of the design can be run in the simulator, and the long, uninteresting parts can be left to the hardware. Myself, I wondered whether this system could form the basis of an exhaustive hardware simulator: one that runs a design in hardware until comes to some sort of ‘branch’ or ‘decision point’, at which point it saves its current state, and executes one of the paths. Later it restores this saved state and executes the other path. Thus, it might be feasible to explore all the possible behaviours of a system.

Lana Josipovic‘s Best-Paper-award-winning paper on “Buffer Placement and Sizing for High-Performance Dataflow Circuits” (with Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and Jordi Cortadella) was concerned with improving the performance of dynamically scheduled circuits, which can be thought of as a little like Petri nets. Inserting buffers into these circuits cannot change their functional behaviour, because every component executes when (and only when) all of its inputs are ready; adding buffers may mean that these inputs become ready a little later, but it cannot change their values. Nonetheless, there are a few reasons for inserting buffers into a circuit: to break up the critical path so as to increase the operating frequency, to break a combinational cycle, or to balance the latency of the tines in a control flow fork. Lana uses linear programming to work out when to place buffers in a circuit, where to place them, and how big to make each one; the result is an HLS tool that can generate considerably faster circuits.

Han Chen‘s paper on “FPGA-Accelerated Samplesort For Large Data Sets” (with Sergey Madaminov, Michael Ferdman, and Peter Milder) addresses one of the most long-studied problems in computer science: how can we sort data quickly? They describe a hardware-accelerated implementation of samplesort, which is a pleasingly simple sorting algorithm that I had not come across before. Samplesort is a little like mergesort, in that it involves partitioning the dataset into lots of parts that are all sorted independently. But where mergesort splits the dataset blindly, sorts each part, and finally merges the sorted subsets back together, samplesort first divides the dataset into several buckets (with each bucket covering a specific range of values), then sorts each bucket, and finally concatenates the buckets together. So both algorithms have an embarrassingly-parallel middle step, but mergesort has its complexity in the third step and samplesort has its complexity in the first step. Of course, the tricky part of samplesort is working out what ranges to use for each bucket so that each one ends up containing roughly the same number of data items. This is addressed by having a extra preprocessing stage, in which the data is sampled (hence the name of the algorithm) to estimate the appropriate range for each bucket. There are all sorts of interesting trade-offs in Han’s design. How many buckets should there be? (Presumably this depends on how much many buckets it is possible to sort in parallel.) How long should we spend sampling? (Time spent sampling is time not spent sorting, but if the sampling is inaccurate then the subsequent sorting will not be efficient.) How much contingency space should there be in each bucket? (If the buckets are too small, they may spill over, but if they are unnecessarily big, then hardware resources are being wasted.)

Finally, there were also two papers co-authored by me. Yann Herklotz‘s paper on “Finding and Understanding Bugs in FPGA Synthesis Tools” described his efforts to find bugs in tools like Quartus, Vivado, and Yosys; his presentation was greeted with an amusing mixture of alarm and pride from the developers of these tools, depending on the respective number of bugs he found in each! And Jianyi Cheng‘s paper on “Combining Dynamic and Static Scheduling in High-level Synthesis” (also with Lana Josipovic, George Constantinides, and Paolo Ienne) described his quest for a scheduling approach that combines the advantages of static (i.e. compile-time) scheduling and dynamic (i.e. run-time) scheduling. Both Yann and Jianyi did a great job with their presentations, making me a very proud supervisor!