Fuzzy Rough Set Approximations in Large Scale Information Systems

dc.contributor.advisorDe Cock, Martineen_US
dc.contributor.authorAsfoor, Hasan M.en_US
dc.date.accessioned2015-05-11T20:27:19Z
dc.date.available2015-05-11T20:27:19Z
dc.date.issued2015-05-11
dc.date.submitted2015en_US
dc.descriptionThesis (Master's)--University of Washington, 2015en_US
dc.description.abstractRough set theory is a popular and powerful machine learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies, i.e. objects that have the same values for the conditional attributes but a different value for the decision attribute. In line with the emerging granular computing paradigm, rough set theory groups objects together based on the indiscernibility of their attribute values. Fuzzy rough set theory extends rough set theory to data with continuous attributes, and detects degrees of inconsistency in the data. Key to this is turning the indiscernibility relation into a gradual relation, acknowledging that objects can be similar to a certain extent. In very large datasets with millions of objects, computing the gradual indiscernibility relation (or in other words, the soft granules) is very demanding, both in terms of runtime and in terms of memory. It is however required for the computation of the lower and upper approximations of concepts in the fuzzy rough set analysis pipeline. In this thesis, we present a parallel and distributed solution implemented on both Apache Spark and Message Passing Interface (MPI) to compute fuzzy rough approximations in very large information systems. Our results show that our parallel approach scales with problem size to information systems with millions of objects. To the best of our knowledge, no other parallel and distributed solutions have been proposed so far in the literature for this problem. We also present two distributed prototype selection approaches that are based on fuzzy rough set theory and couple them with our distributed implementation of the well known weighted k-nearest neighbors machine learning prediction technique to solve regression problems. In addition, we show how our distributed approaches can be used on the State Inpatient Data Set (SID) and the Medical Expenditure Panel Survey (MEPS) to predict the total healthcare expenses of patients.en_US
dc.embargo.termsOpen Accessen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherAsfoor_washington_0250O_14189.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/33133
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectapproximations; big data; fuzzy rough set; machine learning; MPI; Sparken_US
dc.subject.otherComputer scienceen_US
dc.subject.othercomputer science and engineeringen_US
dc.titleFuzzy Rough Set Approximations in Large Scale Information Systemsen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Asfoor_washington_0250O_14189.pdf
Size:
1.76 MB
Format:
Adobe Portable Document Format