Efficient Deep Learning for Visual and Textual Data
Loading...
Date
Authors
Mehta, Sachin
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Efficient hardware, increased computational power, and smart sensors, are powering deep learning, and moving intelligence from the cloud to edge devices (e.g., smartphones, smart cameras, and wearables). However, deep neural networks are computationally expensive and are difficult to deploy on edge devices because of limited computational capabilities, including limited energy overhead. To enable on-device AI, we focus on developing efficient neural architectures for edge devices. We optimize the basic building blocks of deep neural networks (e.g., convolutional layers and linear transformation functions) and build light-weight, fast, and memory-efficient deep neural architectures. Besides efficiency, we also study the generalization capabilities of our architectures on different datasets and tasks. We start by designing efficient architectures for computer vision tasks. In the first part of the dissertation, we introduce the Efficient Spatial Pyramid (ESP) unit, an efficient alternative to standard convolution layers in convolutional neural networks (CNNs). Compared to standard convolutions, the ESP unit allows networks to learn representations from a large receptive field with fewer parameters and operations. Our efficient architecture, ESPNet, that is built using the ESP module is able to deliver similar or better performance than state-of-the-art efficient neural architectures, such as MobileNets and ShuffleNets, across different tasks (e.g., image classification and object detection) while being 2 times more power efficient. The second part is geared towards improving the performance of efficient CNNs. We introduce a novel and generic convolutional unit, the DiCE unit, that is built using dimension-wise convolutions and dimension-wise fusion. The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor, while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the DiCE unit to efficiently encode spatial and channel-wise information contained in the input tensor. When DiCE units are stacked to build the DiCENet network, we observe better task-level generalization capabilities over state-of-the-art methods, including efficient architectures (e.g., MobileNetv3) that are constructed using neural architecture search. Next, we focus on designing efficient architectures for natural language processing tasks. In the third part of this dissertation, we introduce the pyramidal recurrent unit (PRU), a drop-in replacement to widely used LSTMs. PRUs replace the linear transformations in LSTMs with pyramidal and group linear transformations; this enables learning representations in high dimensional space with fewer parameters and operations. Besides efficiency, our quantitative and qualitative analysis shows that PRUs have better gradient coverage as compared to LSTMs, which helps them deliver better performance. In the fourth part, we introduce a deep and light-weight transformer (DeLighT) that delivers similar or better performance than standard transformer-based models with significantly fewer parameters and operations on sequence modeling tasks. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling, which allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Our empirical evaluation on machine translation and language modeling tasks shows that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average. Overall, this dissertation introduces neural architectures for learning generalizable representations from visual and textual data efficiently.
Description
Thesis (Ph.D.)--University of Washington, 2021
