Distributed Diverging Topic Models: A Novel Algorithm for Large Scale Topic Modeling in Spark

dc.contributor.advisorDe Cock, Martineen_US
dc.contributor.authorMarquardt, James Andrewen_US
dc.date.accessioned2015-02-24T17:28:57Z
dc.date.available2015-02-24T17:28:57Z
dc.date.issued2015-02-24
dc.date.submitted2014en_US
dc.descriptionThesis (Master's)--University of Washington, 2014en_US
dc.description.abstractIn their 2001 work Latent Dirichlet Allocation, Blei, Ng, and Jordan proposed the generative model of the same name that has since become the basis for most research in the field of topic modeling. The model overcame many of the shortcomings of previous probabilistic models such as allowing the inference of topics in documents not present in the learning phase, as well as allowing for topic mixtures. In the past decade the algorithm for inferring the probabilities associated with the model has been implemented in many different languages, been extended to allow topic relationships with other entities such as emotion and document label, and optimized in a variety of ways to allow faster learning. Latent Dirichlet Allocation (LDA) has found applications within a wide variety of disciplines; including digital humanities, computational social science, e-commerce, and government science policy. In short, the numerous advances and applications illustrate the significant influence of the original LDA algorithm. However, in spite of the numerous publications and tools created as a result of LDA, the model suffers from one issue: it is extremely computationally intensive. This shortcoming is so great that its utility towards large datasets of the scale of those mined from the Internet is somewhat questionable. Additionally, the topic modeling algorithm often requires a degree of active learning, requiring feedback from a domain expert, which in certain circumstances would be ideally minimized. In this work we present Distributed Diverging Latent Dirichlet Allocation (DD-LDA), a novel algorithm for the creation of topic models based on the original Latent Dirichlet Allocation model. The algorithm takes advantage of recent advances in distributed systems approaches to computation, and demonstrates its utility through decreased time requirements as well as increased model performance via the ability to intelligently determine appropriate model size.en_US
dc.embargo.termsOpen Accessen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherMarquardt_washington_0250O_14002.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/27372
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectLatent Dirichlet Allocation; Spark; Topic Modelingen_US
dc.subject.otherComputer scienceen_US
dc.subject.othercomputing and software systemsen_US
dc.titleDistributed Diverging Topic Models: A Novel Algorithm for Large Scale Topic Modeling in Sparken_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Marquardt_washington_0250O_14002.pdf
Size:
565.96 KB
Format:
Adobe Portable Document Format