I had a great time in Boulder, Colorado at the FCCM 2018 conference last week. Here are a few of the papers that I found particularly interesting:
- Siddhartha and Kapre’s paper, Hoplite-Q: Priority-Aware Routing in FPGA Overlay NoCs, was about how to build a “network-on-chip”; that is, a miniature network that allows the various components of an FPGA circuit to communicate with each other without having the overhead of connecting every pair of components with a direct wire. Networks-on-chip have been proposed before, but what is new in this paper is the ability for each packet of data travelling through the network to have a “priority” attached to it. Each router in the network then decides how to direct each packet based on its priority. The point of adding priorities like this is to support multiple applications running on the same FPGA, with some applications deemed more important than others. Priorities also provide a means to reduce the chance that an unlucky packet will get stuck going round and round the network for a very long time, by arranging that whenever a router sees the same packet again, it increments that packet’s priority. What I liked most about this paper was that the design of the network-on-chip is very simple, and the method for adding priorities is also very understandable. I could see that there were many design choices to be made about how to add priorities, and I enjoyed thinking about the trade-offs associated with each choice.
- Ragheb and Anderson’s paper, High-Level Synthesis of FPGA Circuits with Multiple Clock Domains, showed how to extend the LegUp tool, which translates a C program into an FPGA circuit, so that it generates hardware with more than one clock domain. A clock domain is a part of a circuit that is controlled by the same clock pulse, and the problem with having one clock domain for your whole circuit is that the clock rate is limited by the slowest part of the circuit. Ragheb and Anderson’s technique identifies the functions where the C program tends to spend the most time, and maps those functions to a separate clock domain from the rest of the circuit, in the hope that these functions can be clocked at a faster rate. The difficulty comes when you want components from two different clock domains to communicate with each other — to do this you need a fairly complicated ‘handshaking’ protocol. Nonetheless, even with the overhead of this handshaking faff, the fact that the subcircuit is operating at a faster clock rate means that Ragheb and Anderson are still able obtain decent overall speedups.
- Islam and Kapre’s paper, LegUp-NoC: High-Level Synthesis of Loops with Indirect Addressing, could be seen as a fusion of the first two papers I have mentioned in this post already. Islam and Kapre observed that when translating certain C loops into hardware, if there is uncertainty about which memory location each iteration is writing to (e.g., because the location depends on data that is only available at runtime) then high-level synthesis tools like LegUp tend to generate a crossbar of wires, in which every iteration is directly connected to every memory location, so as to keep all options open. The problem is that crossbars take up a lot of space on the FPGA. Islam and Kapre’s solution is to replace the crossbar with a network-on-chip. Although there is bound to be some delay as a result of routing data packets through this network, the authors make the case that this is a price worth paying in order to avoid the expensive crossbar. I enjoyed this paper a lot, though as my student Nadesh Ramanathan pointed out during the Q&A session, there are rather a lot of restrictions on the kind of loop that the proposed solution can be applied to.
- A high-level synthesis tool generates a hardware description from a C program. To actually obtain an executable piece of hardware, this description has to be synthesised into an FPGA configuration via a process that can take several hours or even days. To ease the development process, most high-level synthesis tools provide a report that estimates the properties of the final hardware — such as how large the circuit will be, and what clock rate should be achievable. Dai et al.’s short paper, Fast and Accurate Estimation of Quality of Results in High-Level Synthesis with Machine Learning, explains that these reports are often wildly inaccurate. To address this problem, Dai et al. have built a tool that uses machine learning techniques to generate much more accurate reports. This was a really nice paper, and the worthy winner of the Best Short Paper award at the end of the conference. I think it would be interesting to investigate how well Dai et al.’s tool can cope with hardware descriptions that are handcrafted rather than generated from high-level synthesis.
- Jain et al.’s paper, Microscope on Memory: MPSoC-enabled Computer Memory System Assessments, focused on a class of devices called systems-on-chip, in which a conventional processor and an FPGA are packed into a single chip. They propose to use the system-on-chip to evaluate memory systems. Specifically, the conventional processor is used to run a benchmarking application, and the surrounding FPGA is used to simulate the newfangled memory system of interest. Simulating the memory system in hardware is much faster than doing so in software, which means that the simulation only needs to be about 20x slower than real life. I wondered whether this setup might be applicable to testing other peripherals besides memory systems, such as ethernet adapters.
- Cheuk-Lun et al.’s paper, FPGA-based Real-time Super-Resolution System for Ultra High Definition Videos, showed how to use an FPGA to convert a low-resolution image (around 2000 pixels across) into a high-resolution image (around 4000 pixels across) in a manner that avoids introducing too many visual artefacts like pixellation and blur. The technique certainly seemed to work very well, though I did notice a possible weakness in the way the authors evaluated it. They started with high-resolution images, downgraded them to low-resolution images, and then checked how closely their technique could recreate the original high-resolution images — what if their technique, which uses machine learning at its core, is overly dependent on the specific method they use to obtain the low-resolution images? Would it still work as well if given low-resolution images that have not been extracted from a high-resolution image in the same way, I wonder?
- Saitoh et al.’s paper, A High-Performance and Cost-Effective Hardware Merge Sorter without Feedback Datapath, proposed a new hardware implementation of mergesort — a popular algorithm for sorting data, and one that is particular suited to implementation in hardware because of its amenability to parallelisation. The implementation was actually rather similar to one presented at FCCM last year, but with one crucial difference. Last year’s implementation had a feedback path: a wire that carried data from the output of one of the stages back to its input. The presence of this feedback path meant that that stage of the implementation could not be pipelined. Saitoh et al.’s new implementation uses a neat trick to rewrite this stage without the feedback path, so that it can now be pipelined, and hence be run much faster — about 60% faster if I remember rightly.