Healthcare Data Mining Using In-database Analytics to Predict Diagnosis of Inflammatory Bowel Disease
MetadataShow full item record
Inflammatory Bowel Disease is a life-changing affliction with few correlative and no known causal factors for its two major forms. With the widespread use of electronic medical record systems, and therefore the availability of large, highly dimensional, and semi-structured data, the importance of finding effective and scalable data mining algorithms to handle such data has increased dramatically. With these algorithms, one can develop useful predictive and analytical tools for providers and researchers to ultimately improve the quality of patient lives. In recent years, there has been a growing interest in the classical application of Incremental Gradient Descent techniques to convex programming problems because of their rapid convergence and tolerance to noise. And with the recent development of in-database analytics frameworks leveraging Incremental Gradient Descent algorithms and other user-defined aggregates as in the Bismarck architecture, rapid analysis of large and highly-dimensional data is facilitated. In this thesis, we describe the first ever application of the Bismarck in-database analytics framework in a healthcare setting. We applied logistic regression using Bismarck on a four-year set of patient demographic, encounter, and hospital account data and produced predictive risk factors for a cohort of Inflammatory Bowel Disease patients. We also developed a simple, automated model builder framework that supports other cohorts of interest, and discuss its design. We also outline our future steps to extend the algorithms to include spatial data analysis and to provide data visualization tools that assist providers and researchers in gaining insight into the correlative factors behind the disease. The challenges of the clinical data set - large, highly dimensional, heterogenous, with statistically significant amounts of noise - highlight the advantages of the key-value structures the Bismarck architecture leverages. The predictive models produced were better than random and built on commodity hardware running an open source, distributable, database engine. Since the Bismarck in database analytics framework is scalable and parallelizable and facilitates straightforward extension and modification, the success of our application has shown the viability of producing predictive models for other cohorts of interest in a similar healthcare setting.