Machine learning methods for biological hypothesis generation, facilitating new discoveries at lower costs

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Machine learning methods for biological data have become increasingly popular in recent years, acknowledging the transformative applications, complex patterns, and latent variation underlying biological systems. Importantly, many biological measurements are very expensive to produce experimentally. This poses challenges for biological discovery, limiting the number of experiments that can practically be conducted, and for data-hungry machine learning methods, which may require massive datasets that are not publicly available. One approach to these challenges is computational simulation with generative machine learning models, leveraging available high-quality data from heterogeneous sources to synthesize additional datapoints for subsequent analyses, which can help propose novel and prioritize existing biological hypotheses that can subsequently be tested in an experimental lab. In this thesis, I present three methods for high-quality in silico data generation across three biological domains: genomic time series extrapolation with Sagittarius, high-resolution dense chromatin contact map generation with Capricorn, and approximately-automatically-curated gene network generation using augmented network integration with Gemini. These diverse applications focus on high-cost experimental data, highlighting the immense value of computational datapoint simulation, and heterogeneous biological measurements, requiring methods that account for the diverse inputs and leverage all sources of information to improve the generation process. Finally, I connect each model back to its practical applications in biology, ranging from assisting biological experts in their current work to novel hypothesis generation.

Description

Thesis (Ph.D.)--University of Washington, 2024

Citation

DOI