Powering phosphoproteomics with large scale data analysis and machine learning

Barente, Anthony Scott

Powering phosphoproteomics with large scale data analysis and machine learning

Files

Barente_washington_0250E_24562.pdf (10.02 MB)

Date

2022-07-14

relationships.isAuthorOf

Barente, Anthony Scott

Abstract

Cells are the fundamental biological units of organisms and are constantly changing their internal state in response to external stimuli and stresses. A common way in which they do this is through the addition and subtraction of chemical tags from proteins, which allows the cells to exert fine grained control over protein activity. One of these tags, phosphorylation, is unique for its essential role in signaling cascades. By linking together chains of proteins turning on and off each other through phosphorylation, cells can build sophisticated networks capable of transforming stimuli into the appropriate biological response. High throughput tools such as mass spectrometry are ideal for studying phosphorylation, as they provide the capability to track the dynamics of thousands of modified sites across treatments. In recent years, this technique has only become more popular, with the number of submissions to public repositories for mass spectrometry data growing every month. By bringing together multiple phosphorylation studies into one dataset, we have the potential to learn fundamental properties about how phosphopeptides behave across instruments, and improve our assays. In addition to the amount of data, phosphoproteomics datasets have continued to grow in size with the improvement of sample preparation and data acquisition technologies. While this growth allows for more conditions and subjects to be included in a single study, it comes along with fundamental computational and statistical challenges. Within this thesis, I will present two stories which explore these avenues of research. First, I will present the analysis of a large scale yeast phosphoproteomics perturbation screen. With this I will show how the comparison of phosphosite dynamics across multiple treatments can lead to prioritized targets for further research and provide valuable information about the regulatory relationship between phosphosites. After this analysis, I will present my efforts to build a centralized resource for building targeted phosphoproteomics assays. Here I will first present pyAscore, a versatile and fast python package for performing an essential step in phosphopeptide identification. Then, I will detail an automated and reproducible pipeline for integrating publicly available phosphoproteomics data into a centralized knowledgebase, Phosphopedia 2.0. Finally, I will present work to predict phosphopeptide retention time and charge state from amino acid sequence, which has allowed Phosphopedia 2.0 to move beyond detections and provide information about any phosphopeptide.