Optimizing Data Processing Through Verified Lifting

Loading...
Thumbnail Image

Authors

Ahmad, Maaz Bin Safeer

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Some of the most exciting and impactful software systems being built today, from data analytics to artificial intelligence, rely on computationally expensive algorithms. A key hurdle in bringing these technologies to end-users is finding highly-optimized implementations for these algorithms. Unfortunately, the ever-increasing complexity of hardware architectures coupled with the diverse set of available hardware backends makes writing highly efficient and portable code challenging. Over the past decade, a variety of domain-specific languages (DSLs) have been developed to automatically generate near-optimal device-specific implementations from high-level specifications. However, existing software written in general-purpose programming languagessuch as C++ or Java does not automatically benefit from these new DSL compilers. In fact, legacy software must first be re-written using the DSL’s domain-specific application interface (API). This presents an enormous engineering burden as the amount of code that needs to be re-written may be large. Any optimizations in legacy software can obfuscate the embedded algorithms, making the code difficult to understand and translate. In addition, re-writing code using an entirely different interface risks the introduction of bugs and unintended changes to the semantics of the application. Finally, as hardware continues to evolve, new DSLs continue to emerge leaving developers perpetually chasing the state-of-the-art. This thesis addresses the problem of how to automatically translate legacy data-processing code to high-level domain-specific APIs. To do so, we build compilers that use verified lifting to first generate a semantic summary of the legacy code. The program summary, writtenusing a high-level intermediate representation (IR), describes the semantics of the algorithm implemented in the legacy code. The compiler then uses the generated summary to produce new code in the target DSL’s API. Our verified-lifting-based compilers use program synthesis to infer the program summaries without needing any re-write rules and verify that each summary is an exact semantic match to the input legacy code. To demonstrate the feasibility and efficacy of our approach, we introduce three verified lifting-based compilers. Casper is a tool that automatically re-writes legacy Java code to MapReduce frameworks such as Hadoop or Apache Spark. Since different implementations of the same algorithm are often possible within MapReduce frameworks, Casper uses a domain-specific cost model to identify highly efficient implementations of the legacy code. Dexter is a tool designed to automatically re-write legacy image-processing functions written in C++ to Halide, a modern DSL for image processing. Dexter introduces a novel algorithm that deconstructs the synthesis of program summaries into three distinct stages, allowing it to scale to complex real-world code. Finally, Rake is a tool that uses our verified lifting methodology to perform instruction selection within the Halide DSL. Modern hardware accelerators often implement complex domain-specific patterns in their instruction-set (ISA) and rule-based instruction selection is not always able to detect the best usage of these higher-level instructions. Rake uses verified lifting to re-write input expression into an IR of uber-instructions before lowering the IR representation to the target backend. The tools presented in this thesis have been used to translate tens of thousands of lines of code, including code from real-world applications such as Adobe Photoshop. Together, they demonstrate the value of using program syntheses to implement provably correct lifting transformations to optimize data-processing code. We believe this thesis could be the prelude to a new generation of compilers, ones that can infer higher-level semantics of the input code and port algorithms across different abstractions to unlock the best optimizations.

Description

Thesis (Ph.D.)--University of Washington, 2022

Citation

DOI