A Joint Model Provisioning and Request Dispatch Solution for Mobile Inference Serving at the Edge

Loading...
Thumbnail Image

Authors

Prasad, Anish Nagendra

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

With the advancement of machine learning (ML), a growing number of mobile clients rely onML inference for making time-sensitive and safety-critical decisions. Therefore, the demand for high-quality and low-latency inference services at the network edge has become the key to the modern intelligent society. This thesis proposes a novel solution that jointly provisions inference models and dispatches inference requests for reducing the latency of mobile inference serving on edge nodes. Unlike existing solutions that either direct inference requests to the nearest edge node or balance the workload between edge nodes, the solution we propose provisions each edge node with the optimal type and the number of inference serving instances under a holistic consideration of networking, computing, and memory resources. Mobile clients can thus utilize ML inference services on edge nodes that offer minimal inference serving latency. In this work, we implement the proposed solution using TensorFlow Serving and Kubernetes on a cluster of edge nodes, including Nvidia Jetson Nano and Jetson Xavier. We further demonstrate the proposed solution’s effectiveness in reducing the overall inference latency under various system parameters and practical system settings through simulation and testbed experiments, respectively.

Description

Thesis (Master's)--University of Washington, 2021

Citation

DOI