Protein structure determination using evolutionary information
MetadataShow full item record
For billions of years, nature has been conducting the greatest experiment of all time. Imagine one day gaining access to the detailed notes from these experiments. Today, with worldwide expeditions to collect samples from all habitats, single-cellular sequencing of unculturable microbes and rapid drop in sequencing cost, we can finally tap into nature and gain access to these notes. Natural selection acts upon a gene to optimize its sequence to perform a task. For protein-coding genes, the task includes folding, stability, and function. The record of the evolutionary process, which in itself is probabilistic, is contained within a multiple sequence alignment. A statistical model that accurately describes these evolutionary constraints for a given gene or a set of genes, should allow for the inference of the underlying physical molecular structure and interactions. Recently, it was shown that a global statistical model of a protein family that captures both conservation and coevolution patterns in the family, to possess such quality. The strength of co-evolution term is correlated with residue-residue contacts in three-dimensional space. This means that not only can this information be used to predict contacts in proteins that have no structure, but also to better understand contacts in known structures. To assess the utility and the limitation of the method, we applied our method (GREMLIN) to predict residue-residue level interactions between proteins and within a protein. These contacts were used to predict the structure of 58 protein families and complexes. Nine of these structures have since been determined with traditional experimental methods and were found to be quite accurate. Most recently we extended the approach to small protein families by recruiting metagenomic sequences. Using this approach, we provided the predictions for over 600 protein families (>10% of PFAM).