Cross-Stack Co-Design for Efficient and Adaptable Hardware Acceleration
Moreau, Thierry Jean
MetadataShow full item record
Hardware accelerators are becoming more critical than ever in scaling the capabilities of computer systems in a post-Dennard scaling computing landscape. As abstractions like ISAs and intermediate representations are shifting constantly, building capable software stacks that expose familiar programming interfaces poses a significant engineering challenge. In addition, the push for ever more efficient and cost-effective hardware has brought the need to expose quality-efficiency tradeoffs across the stack, particularly in hardware accelerators where computation and data movement dominates energy. The goal of this dissertation is to propose hardware and software techniques that work in concert to facilitate the integration of hardware accelerators in today's ever-evolving compute stack. Specifically, we look at co-design methodologies that (1) make it easy to program specialized accelerators, (2) allow for adaptability in the context of evolving workloads, and (3) expose quality-efficiency knobs across the stack to adapt to shifting user requirements. In Chapter 1, I discuss why specialization is critical to push the capabilities of modern systems, and identify challenges that remain in the way to provide efficient and adaptable specialization moving forward. In Chapter 2, I present SNNAP, a hardware design coupled with a familiar software API that approximately offloads diverse compute-intensive regions of code to a tightly coupled FPGA to deliver significant energy savings. This approach makes it much easier to target FPGAs for software programmers, as long as they can express quality bounds for their target application. In Chapter 3, I present QAPPA an C/C++ compiler framework that can target quality programmable accelerators, i.e. accelerator designs that expose quality knobs in their ISA. The key of QAPPA is to translate application-level quality bounds into instruction-level quality settings via an auto-tuning process. In Chapter 4, I present the VTA hardware-software stack designed for extensible deep learning acceleration as data sets, models, and numerical representations evolve. VTA exposes a layered stack that offloads design complexity away from hardware: this makes updating the stack to support new models and operators a software-centric challenge. Finally, in Chapter 5, I discuss recent efforts outside of the research realm aimed at popularizing access and reproducibility of cross-stack pareto-Optimal design.