Robust Submodular Partitioning and Linear Models of Deep ReLU Networks

dc.contributor.advisorBilmes, Jeffery A
dc.contributor.authorWang, Shengjie
dc.date.accessioned2021-08-26T18:08:47Z
dc.date.issued2021-08-26
dc.date.submitted2021
dc.descriptionThesis (Ph.D.)--University of Washington, 2021
dc.description.abstractMachine learning models, especially deep neural networks, have achieved great success in numerous real-world tasks. As we achieve better performance with larger models, one major challenge emerges that the costs of training machine learning systems become expensive and even prohibitive. Also, the deep learning model works as a block box in many applications with little interpretation of its behaviors. In this dissertation, we investigate two problems: 1) partitioning of training data into diverse and representative blocks for gradient computation to get improved efficiency and performance for machine learning models and 2) decomposition of ReLU deep neural networks as a collection of linear models for data points and we utilize the linear models to better understand and improve the network performance. For the {\bf{first part}} of the thesis, we first investigate the problem of partitioning the training dataset into multiple blocks which are equally diverse. The theoretical abstraction of the problem is denoted as robust submodular partitioning. In robust submodular partitioning, we aim to allocate a set of items into $m$ blocks, so that the evaluation of the minimum block according to a submodular function is maximized. Robust submodular partitioning promotes the diversity of every block in the partition. It has many applications in training machine learning models, e.g., partitioning data into blocks for distributed training so that the gradients computed for every block are consistent. We study the robust submodular partitioning problem and give an efficient Min-Block Greedy algorithm with a $1/m$ guarantee. We further study an extension of the robust submodular partition problem with an additional constraint (e.g., cardinality, multiple matroids, or knapsack) on every block. For example, when partitioning data for distributed training, we can add a constraint that the number of samples of each class is the same in each partition block, making the partitioned data balanced. We present two classes of algorithms, i.e., Min-Block Greedy based algorithms ($\Omega(1/m)$ bound), and Round-Robin Greedy based algorithms (constant bound) and show that under various constraints, they still have good approximation bounds. We further investigate the robust submodular partitioning problem under cardinality constraint and apply it to generate high-quality mini-batches for stochastic gradient methods. With computational hardware (e.g., GPUs) getting dramatically faster over time, sampling a mini-batch of data points uniformly at random becomes less practical, as randomly accessing data points from disk can be slow, leading to a bottleneck for modern machine learning systems. In practice, datasets are typically written to disk according to an arbitrarily generated sequence of indices. This makes sequential access of this chosen order possible with low overhead compared to random access. On the other hand, there is a chance that the sequence is poor for training, and since it is fixed over multiple iterations of training, performance can suffer. We prove better bounds of the Min-Block Greedy algorithm for this case and greatly reduce the memory/computation costs by applying hierarchical partitioning. We compare our deterministically generated mini-batch sequences to randomly generated sequences and show that the deterministic sequences significantly beat the mean and worst performance of random sequences, and often outperform the best of the random sequences. For the {\bf{second part}} of the thesis, we focus on understanding and improving the ReLU deep network through its decomposition as a linear model for every data point. A ReLU deep network (or more generally for deep networks with piecewise linear activation functions) is essentially a piecewise linear model. Therefore, the model is locally linear around every data point, and the linear model weights are equal to the gradient of the network output with respect to its input data point. Based on this observation, we first introduce the Extended Data Jacobian Matrix (EDJM) as an architecture-independent tool to analyze neural networks at the manifold of interest. For ReLU networks, the EDJM is essentially a collection of linear models for all data points, represented as a matrix. The spectrum of the EDJM is found to be highly correlated with the complexity of the learned functions. After studying the effect of dropout, ensembles, and model distillation using EDJM, we propose a novel spectral regularization method that improves network performance.However, we note that such a regularization method has greatly increased computational costs, limiting its practical usage. Next, we show an efficient regularization method Jumpout, an improved version of dropout, based on linear models of ReLU networks. We discuss three novel insights about dropout for DNNs with ReLUs: 1) dropout encourages each local linear piece of a DNN to be trained on data points from nearby regions; 2) the same dropout rate results in different (effective) deactivation rates for layers with different portions of ReLU deactivated neurons; and 3) the rescaling factor of dropout causes a normalization inconsistency between training and test when used together with batch normalization. The above leads to three simple but nontrivial modifications resulting in our method “Jumpout.” Jumpout significantly improves the performance of different neural nets on multiple datasets, while introducing negligible additional memory and computation costs. Finally, we aim to explain the network behavior based on the linear model for every data point, particularly based on the bias term of the linear model. The gradient of a deep neural network (DNN) w.r.t. the input provides information that can be used to explain the output prediction in terms of the input features and has been widely studied to assist in interpreting DNNs. In a linear model (i.e., $g(x) = wx + b$), the gradient corresponds to the weights w. The bias b, however, is usually overlooked in attribution methods. We observe that since the bias in a DNN also has a non-negligible contribution to the correctness of predictions, it can also play a significant role in understanding DNN behavior. We propose a backpropagation-type algorithm “bias back-propagation (BBp)” that starts at the output layer and iteratively attributes the bias of each layer to its input nodes as well as combining the resulting bias term of the previous layer. Together with the backpropagation of the gradient generating w, we can fully recover the locally linear model $g(x) = wx + b$. In experiments, we show that BBp can generate complementary and highly interpretable explanations.
dc.embargo.lift2022-08-26T18:08:47Z
dc.embargo.termsRestrict to UW for 1 year -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherWang_washington_0250E_22762.pdf
dc.identifier.urihttp://hdl.handle.net/1773/47434
dc.language.isoen_US
dc.rightsCC BY-NC-ND
dc.subjectDeep Learning
dc.subjectSubmodular Optimization
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleRobust Submodular Partitioning and Linear Models of Deep ReLU Networks
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_washington_0250E_22762.pdf
Size:
11.63 MB
Format:
Adobe Portable Document Format