A Joint Model Provisioning and Request Dispatch Solution for Mobile Inference Serving at the Edge

Prasad, Anish Nagendra

A Joint Model Provisioning and Request Dispatch Solution for Mobile Inference Serving at the Edge

Files

Prasad_washington_0250O_23263.pdf (552.28 KB)

Date

2021-08-26

Authors

Prasad, Anish Nagendra

Abstract

With the advancement of machine learning (ML), a growing number of mobile clients rely onML inference for making time-sensitive and safety-critical decisions. Therefore, the demand for high-quality and low-latency inference services at the network edge has become the key to the modern intelligent society. This thesis proposes a novel solution that jointly provisions inference models and dispatches inference requests for reducing the latency of mobile inference serving on edge nodes. Unlike existing solutions that either direct inference requests to the nearest edge node or balance the workload between edge nodes, the solution we propose provisions each edge node with the optimal type and the number of inference serving instances under a holistic consideration of networking, computing, and memory resources. Mobile clients can thus utilize ML inference services on edge nodes that offer minimal inference serving latency. In this work, we implement the proposed solution using TensorFlow Serving and Kubernetes on a cluster of edge nodes, including Nvidia Jetson Nano and Jetson Xavier. We further demonstrate the proposed solution’s effectiveness in reducing the overall inference latency under various system parameters and practical system settings through simulation and testbed experiments, respectively.