Explainable Machine Learning and Applications in Protein-Ligand Complex Structure Prediction

Baker, DavidSturmfels, Pascal2024-09-092024-09-092024-09-092024Sturmfels_washington_0250E_26746.pdfhttps://hdl.handle.net/1773/51872Thesis (Ph.D.)--University of Washington, 2024This thesis touches upon two main topics: interpreting machine learning models, and the application of machine learning to protein sequences and structures. The first portion deals with feature attribution techniques, which attribute attribution scores on a per-instance basis to a machine learning model that represent that model locally around that instance as linear. Three methods are proposed: expected gradients, attribution priors, and integrated hessians, that extend interpretability beyond feature attribution towards feature interaction and training more interpretable models. The second portion deals with training protein language models and how best to design semi-supervised pre-training tasks. It takes inspiration from multiple sequence alignments to propose two tasks - profile prediction and seq2msa - that extend language modeling beyond autoregressive and masked language modeling. The third portion deals applications in protein structure prediction, chiefly with predicting the structure of proteins in concert with small molecules. Jointly determining both the structure of a protein from its sequence input and how small molecule binding partners dock to that structure remains an open and challenging problem, and has applications in biological discovery, virtual screening, and de-novo design. This thesis discusses the development of a structure prediction network, RoseTTAFold All-Atom, capable of simultaneous folding and docking, as well as some applications enabled by that existing network.application/pdfen-USCC BYComputer scienceBiochemistryBioinformaticsComputer science and engineeringExplainable Machine Learning and Applications in Protein-Ligand Complex Structure PredictionThesis