Submodular Optimization and Data Processing
Data sets are large and and are getting larger. Two common paradigms – data summarization and data partitioning, are often used to handle the big data. Data summarization aims at identifying a small sized subset of the data that attains the maximum utility or information, while the goal of data partitioning is to split the data across multiple compute nodes so that the data block residing on each node becomes manageable. In this dissertation, we investigate how to apply submodularity to these two data processing paradigms. In the first part of this thesis, we study the connection of submodularity to the data summarization paradigm. First we show that data summarization subsumes a number of applications, including acoustic data subset selection for training speech recognizers [Wei et al., 2014], genomics assay panel selection [Wei et al., 2016], batch active learning [Wei et al., 2015], image summarization [Tschiatschek et al., 2014], document summarization [Lin and Bilmes, 2012], feature subset selection [Liu et al., 2013], and etc. Among these tasks, we perform case studies on the former three applications. We show how to apply the appropriate submodular set functions to model the utility for these tasks, and formulate the correspond- ing data summarization task as a constrained submodular maximization, which admits an efficient greedy heuristic for optimization [Nemhauser et al., 1978]. To better model the util- ity function for an underlying data summarization task, we also propose a novel “interactive” setting for learning mixtures of submodular functions. For such interactive learning setting, we propose an algorithmic framework and show that it is effective for both the acoustic data selection and the image summarization tasks. While the simple greedy heuristic al- ready efficiently and near-optimally solves the constrained submodular maximization, data summarization tasks may still be computationally challenging for large-scale scenarios. To this end, we introduce a novel multistage algorithmic framework called MultiGreed, to significantly scale the greedy algorithm to even larger problem instances. We theoretically show that MultGreed performs very closely to the greedy algorithm and also empirically demonstrate the significant speedup of MultGreed over the standard greedy algorithm on a number of real-world data summarization tasks. In the second part of this thesis, we connect submodularity to data partitioning. We first propose two novel submodular data partitioning problems that we collectively call Submodu- lar Partitioning. To solve the submodular partitioning, we propose several novel algorithmic frameworks (including greedy, majorization-minimization, minorization-maximization, and relaxation algorithms) that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the state-of-the-art. We show that submodular partitioning subsumes a number of machine learning applications, including load balancing for parallel systems, intelligent data partitioning for parallel training of statistical models, and unsupervised image segmentation. We perform a case study on the last application. For this case, we demonstrate the appropriate choice of submodular utility model and the corresponding submodular partitioning formulation. Empirical evidences suggest that the submodular partitioning framework is effective for the intelligent data partitioning task.
- Electrical engineering