Motif-based mining of protein sequences
We introduce CASTOR, an automatic, unsupervised system for protein motif discovery and classification. Given amino acid sequences for a group of proteins, CASTOR generates statistically significant motifs and constructs a classification of the proteins by performing motif discovery and refinement in a top-down and recursive manner. The members of each class are likely to share a function, and the motifs associated with the class are likely to account for the function.We evaluate CASTOR's performance on the G protein-coupled receptor (GPCR) superfamily. The results show that the CASTOR-constructed classification is in better agreement with a manually curated classification than one constructed by another automatic, unsupervised system based on pairwise, global sequence similarity. Furthermore, while manually constructed classifications tend to be hierarchical, the CASTOR-constructed ones that are non-hierarchical suggest that complex functional relationships among classes may be more abundant than expected.We also apply CASTOR to the mammalian olfactory receptor family, for which very little functional information is available. We infer the potential functional roles associated with the generated motifs and classes by integrating various complex data, such as mutation experiments and ligand binding assays. Among other functional insights gained, we obtain results that support previous hypotheses on structural integrity and post-translational modification. We also propose and provide evidence for a combinatorial molecular mechanism that supports and potentially explains the ligand binding behavior. We additionally define sub-sequences that capture structural features of these receptors and study the motifs present in the sub-sequences.Finally, we introduce CASTOR+, an automatic, supervised system for protein classification. CASTOR+ adds new proteins to a pre-existing classification where each class is associated with specific motifs, such as that generated by CASTOR, by matching selected motifs in the given classification against each new protein. We evaluate the performance of CASTOR+ on the GPCR superfamily. We find that it performs almost as well as an approach based on pairwise, global sequence similarity in terms of classifying proteins against the bottom level of the manually curated classification. Furthermore, it often succeeds even as the other approach fails when the new proteins have no close homologues in the pre-existing classification.