ResearchWorks Archive

Submodularity in Natural Language Processing: Algorithms and Applications

Show simple item record

dc.contributor.advisor Bilmes, Jeffrey A en_US Lin, Hui en_US 2012-09-13T17:25:05Z 2013-03-13T11:04:54Z 2012-09-13 2012 en_US
dc.identifier.other Lin_washington_0250E_10594.pdf en_US
dc.description Thesis (Ph.D.)--University of Washington, 2012 en_US
dc.description.abstract Most natural language processing tasks can be seen as finding an optimal object from a finite set of objects. Often, the object of interest is structured, with combinatorial structures involving trees, matchings, and permutations. The combinatorial nature makes many natural language processing problems challenge to solve. On the other hand, submodularity, also known as the discrete analogous of convexity, makes many combinatorial optimization problems either tractable or approximable where otherwise neither would be possible. Whether submodularity is applicable to natural language processing problems, however, has never been studied before. In this thesis, we fill this gap by exploring submodularity in natural language processing. We show that submodularity is practically useful for many natural language processing tasks since, in addition to giving high-quality approximate solutions to the intractable problems, it also completely captures the essence of many practical situations arises in natural language processing tasks. To do so, we demonstrate the applicability of submodular function optimization to three natural language processing tasks: word alignment for machine translation, optimal corpus creation, and document summarization. In word alignment task, we show that submodularity naturally arises when modeling word fertility in word alignment tasks. We moreover cast word alignment problem as a submodular optimization problem over matroid constraints, which provides a brand new angle of viewing this problem and essentially generalizes conventional matching based approaches. In the task of optimal corpus creation, we first show that the state-of-art method corresponds to using greedy algorithm for supermodular maximization with cardinality constraint, which could perform arbitrarily poorly in theory. Alternatively, we express the problem as a minimization problem over a weighted sum of modular functions and submodular functions. We further study algorithms for general submodular function minimization, where we offer the first empirical study of the complexity of minimum-norm-point algorithm, which is widely accepted as the most practical algorithm for submodular minimization, in the scale of practical interest, and show that on a particular type of submodular functions that arises in practice, minimum-norm-point algorithm's empirical time complexity is as bad as that of the combinatorial algorithms for submodular function minimization. We moreover propose acceleration methods which speed up minimum-norm-point algorithm phenomenally in practise. For the document summarization task, we reveal that many well-established approaches, as well as the evaluation methods, are all correspond to submodular function optimization, giving strong evidences on the fact that submodularity is a natural fit for summarization tasks. The document summarization task, therefore, can naturally be casted as budget submodular maximization problem. We propose efficient algorithm for this problem that scales well to the application, and theoretically show that the algorithm is guaranteed to find near-optimal solutions. We further introduce a class of submodular functions that is not only monotone but also models relevance and diversity simultaneously for document summarization. This class of submodular functions is then generalized to a mixture of submodular components, where each component either models the relevance or models the diversity, and differs either in function forms or in function parameters. The learning problem of submodular mixtures is also addressed, in which we show the risk of approximate learning is bounded by the risk of exact learning where exact inference is used. When evaluated on the standard benchmark task for document summarization, namely Document Understanding Conference (DUC), we achieve best results ever reported on DUC-2004, DUC-2005, DUC-2006, and DUC-2007. en_US
dc.format.mimetype application/pdf en_US
dc.language.iso en_US en_US
dc.rights Copyright is held by the individual authors. en_US
dc.subject Document summarization; Natural language processing; Submodular function optimization en_US
dc.subject.other Computer science en_US
dc.subject.other Electrical engineering en_US
dc.subject.other Electrical engineering en_US
dc.title Submodularity in Natural Language Processing: Algorithms and Applications en_US
dc.type Thesis en_US
dc.embargo.terms Restrict to UW for 6 months -- then make Open Access en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search ResearchWorks

Advanced Search


My Account