Si, DongYang, Jingjing2018-11-282018-11-282018-11-282018Yang_washington_0250O_19328.pdfhttp://hdl.handle.net/1773/42939Thesis (Master's)--University of Washington, 2018The function of a protein is mainly dependent on its exposed surfaces. The protein surface can provide rich information about a protein's function and evolution. Also, identifying protein surface structure is an essential step in the preliminary stages of drug design. There are two worldwide databases for the three-dimensional (3D) structural data of biological molecules: Protein Data Bank (PDB) and Electron Microscopy Data Bank (EMDataBank). The 3D structures in PDB are determined by structural biologists using methods such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM), which shows the location of each atom relative to each other in the molecule. Unlike the atomic model in PDB, the 3D electron density maps in EMDataBank are determined by cryo-electron microscopy, which provides a general description of the protein surface structure and placement of the helices. Because the 3D structures in PDB are solved atom by atom, PDB equips with real-time searches based on sequence, structure, and function. Comparing with PDB, searches based on structure similarity in the EMDataBank are far behind. High computational cost and rotation invariance are two main challenges for 3D protein surface similarity retrieval. Usually, a global descriptor is used to represent a 3D object in low dimensionality. However, traditional 3D object descriptors don't completely utilize the spatial information in 3D space. In order to efficiently investigate protein function and evolutionary history, we introduce a global protein surface shape representation called EMNets descriptors. EMNets provides an effective and accurate way for protein surface representation and similarity search, and thus contributes to biomedical research. The method uses a Convolutional Autoencoder (CAE) neural network to learn the geometric information of 3D density maps in a data-driven manner. Our approach is the first research that applies neural networks to represent and compare global protein surfaces. Our method effectively represents a 3D cryo-electron microscopy density map by using a descriptor, which consists of only 256 numeric variables, called the EMNets descriptor. Based on the EMNets descriptor, we are able to retrieve similar protein surfaces using the K-nearest-neighbor (KNN) strategy in real-time. The search results of protein surface represented with the EMNets descriptor has shown high agreement with the existing Combinatorial Extension (CE) algorithm of sequence and structure similarity search. Overall, EMNets is a powerful tool for comparing 3D protein structures obtained by cryo-electron microscopy. In the future, we can build a real-time retrieval system based on the EMNets descriptors.application/pdfen-USCC BY-NC-NDBioinformaticsMachine LearningMolecular biologyUnsupervised LearningComputer scienceBioinformaticsComputing and software systemsEMNets: A Convolutional Autoencoder for Protein Surface Retrieval Based on Cryo-Electron Microscopy ImagingThesis