System for Serving Deep Neural Networks Efficiently
MetadataShow full item record
Today, Deep Neural Networks (DNNs) can recognize faces, detect objects, and transcribe speech, with (almost) human performance. DNN-based applications have become an important workload for both edge and cloud computing. However, it is challenging to design a DNN serving system that satisfies various constraints such as latency, energy, and cost, and still achieve high efficiency. First, though server-class accelerators provide significant computing power, it is hard to achieve high efficiency and utilization due to limits on batching induced by the latency constraint. Second, resource management is a challenging problem for mobile-cloud applications because DNN inference strains device battery capacities and cloud cost budgets. Third, model optimization allows systems to trade off accuracy for lower computation demand. Yet it introduces a model selection problem regarding which optimized model to use and when to use it. This dissertation provides techniques to improve the throughput and reduce the cost and energy consumption significantly while meeting all sorts of constraints via better scheduling and resource management algorithms in the serving system. We present the design, implementation, and evaluation of three systems: (a) Nexus, a serving system on a cluster of accelerators in the cloud that includes a batch-aware scheduler and a query analyzer for complex query; (b) MCDNN, an approximation-based execution framework across mobile devices and the cloud that manages resource budgets proportionally to their frequency of use and systematically trades off accuracy for lower resource use; (c) Sequential specialization that generates and exploits low-cost and high-accuracy models at runtime once it detects temporal locality in the streaming applications. Our evaluation on realistic workloads shows that these systems can achieve significant improvements in cost, accuracy, and utilization.