A Joint Model Provisioning and Request Dispatch Solution for Mobile Inference Serving at the Edge
Loading...
Date
Authors
Prasad, Anish Nagendra
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With the advancement of machine learning (ML), a growing number of mobile clients rely onML inference for making time-sensitive and safety-critical decisions. Therefore, the demand
for high-quality and low-latency inference services at the network edge has become the key to
the modern intelligent society. This thesis proposes a novel solution that jointly provisions
inference models and dispatches inference requests for reducing the latency of mobile inference serving on edge nodes. Unlike existing solutions that either direct inference requests
to the nearest edge node or balance the workload between edge nodes, the solution we propose provisions each edge node with the optimal type and the number of inference serving
instances under a holistic consideration of networking, computing, and memory resources.
Mobile clients can thus utilize ML inference services on edge nodes that offer minimal inference serving latency.
In this work, we implement the proposed solution using TensorFlow Serving and Kubernetes on a cluster of edge nodes, including Nvidia Jetson Nano and Jetson Xavier. We
further demonstrate the proposed solution’s effectiveness in reducing the overall inference
latency under various system parameters and practical system settings through simulation
and testbed experiments, respectively.
Description
Thesis (Master's)--University of Washington, 2021
