Robust Submodular Partitioning and Linear Models of Deep ReLU Networks

Wang, Shengjie

Robust Submodular Partitioning and Linear Models of Deep ReLU Networks

dc.contributor.advisor	Bilmes, Jeffery A
dc.contributor.author	Wang, Shengjie
dc.date.accessioned	2021-08-26T18:08:47Z
dc.date.issued	2021-08-26
dc.date.submitted	2021
dc.description	Thesis (Ph.D.)--University of Washington, 2021
dc.description.abstract	Machine learning models, especially deep neural networks, have achieved great success in numerous real-world tasks. As we achieve better performance with larger models, one major challenge emerges that the costs of training machine learning systems become expensive and even prohibitive. Also, the deep learning model works as a block box in many applications with little interpretation of its behaviors. In this dissertation, we investigate two problems: 1) partitioning of training data into diverse and representative blocks for gradient computation to get improved efficiency and performance for machine learning models and 2) decomposition of ReLU deep neural networks as a collection of linear models for data points and we utilize the linear models to better understand and improve the network performance. For the {\bf{first part}} of the thesis, we first investigate the problem of partitioning the training dataset into multiple blocks which are equally diverse. The theoretical abstraction of the problem is denoted as robust submodular partitioning. In robust submodular partitioning, we aim to allocate a set of items into $m$ blocks, so that the evaluation of the minimum block according to a submodular function is maximized. Robust submodular partitioning promotes the diversity of every block in the partition. It has many applications in training machine learning models, e.g., partitioning data into blocks for distributed training so that the gradients computed for every block are consistent. We study the robust submodular partitioning problem and give an efficient Min-Block Greedy algorithm with a $1/m$ guarantee. We further study an extension of the robust submodular partition problem with an additional constraint (e.g., cardinality, multiple matroids, or knapsack) on every block. For example, when partitioning data for distributed training, we can add a constraint that the number of samples of each class is the same in each partition block, making the partitioned data balanced. We present two classes of algorithms, i.e., Min-Block Greedy based algorithms ($\Omega(1/m)$ bound), and Round-Robin Greedy based algorithms (constant bound) and show that under various constraints, they still have good approximation bounds. We further investigate the robust submodular partitioning problem under cardinality constraint and apply it to generate high-quality mini-batches for stochastic gradient methods. With computational hardware (e.g., GPUs) getting dramatically faster over time, sampling a mini-batch of data points uniformly at random becomes less practical, as randomly accessing data points from disk can be slow, leading to a bottleneck for modern machine learning systems. In practice, datasets are typically written to disk according to an arbitrarily generated sequence of indices. This makes sequential access of this chosen order possible with low overhead compared to random access. On the other hand, there is a chance that the sequence is poor for training, and since it is fixed over multiple iterations of training, performance can suffer. We prove better bounds of the Min-Block Greedy algorithm for this case and greatly reduce the memory/computation costs by applying hierarchical partitioning. We compare our deterministically generated mini-batch sequences to randomly generated sequences and show that the deterministic sequences significantly beat the mean and worst performance of random sequences, and often outperform the best of the random sequences. For the {\bf{second part}} of the thesis, we focus on understanding and improving the ReLU deep network through its decomposition as a linear model for every data point. A ReLU deep network (or more generally for deep networks with piecewise linear activation functions) is essentially a piecewise linear model. Therefore, the model is locally linear around every data point, and the linear model weights are equal to the gradient of the network output with respect to its input data point. Based on this observation, we first introduce the Extended Data Jacobian Matrix (EDJM) as an architecture-independent tool to analyze neural networks at the manifold of interest. For ReLU networks, the EDJM is essentially a collection of linear models for all data points, represented as a matrix. The spectrum of the EDJM is found to be highly correlated with the complexity of the learned functions. After studying the effect of dropout, ensembles, and model distillation using EDJM, we propose a novel spectral regularization method that improves network performance.However, we note that such a regularization method has greatly increased computational costs, limiting its practical usage. Next, we show an efficient regularization method Jumpout, an improved version of dropout, based on linear models of ReLU networks. We discuss three novel insights about dropout for DNNs with ReLUs: 1) dropout encourages each local linear piece of a DNN to be trained on data points from nearby regions; 2) the same dropout rate results in different (effective) deactivation rates for layers with different portions of ReLU deactivated neurons; and 3) the rescaling factor of dropout causes a normalization inconsistency between training and test when used together with batch normalization. The above leads to three simple but nontrivial modifications resulting in our method “Jumpout.” Jumpout significantly improves the performance of different neural nets on multiple datasets, while introducing negligible additional memory and computation costs. Finally, we aim to explain the network behavior based on the linear model for every data point, particularly based on the bias term of the linear model. The gradient of a deep neural network (DNN) w.r.t. the input provides information that can be used to explain the output prediction in terms of the input features and has been widely studied to assist in interpreting DNNs. In a linear model (i.e., $g(x) = wx + b$), the gradient corresponds to the weights w. The bias b, however, is usually overlooked in attribution methods. We observe that since the bias in a DNN also has a non-negligible contribution to the correctness of predictions, it can also play a significant role in understanding DNN behavior. We propose a backpropagation-type algorithm “bias back-propagation (BBp)” that starts at the output layer and iteratively attributes the bias of each layer to its input nodes as well as combining the resulting bias term of the previous layer. Together with the backpropagation of the gradient generating w, we can fully recover the locally linear model $g(x) = wx + b$. In experiments, we show that BBp can generate complementary and highly interpretable explanations.
dc.embargo.lift	2022-08-26T18:08:47Z
dc.embargo.terms	Restrict to UW for 1 year -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Wang_washington_0250E_22762.pdf
dc.identifier.uri	http://hdl.handle.net/1773/47434
dc.language.iso	en_US
dc.rights	CC BY-NC-ND
dc.subject	Deep Learning
dc.subject	Submodular Optimization
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Robust Submodular Partitioning and Linear Models of Deep ReLU Networks
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Wang_washington_0250E_22762.pdf
Size:: 11.63 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering