Accurate annotation of non-coding RNAs in practical time
Several times each year, one thousand computers at the Sanger Institute in England spend two weeks updating the Rfam Database.This intensive effort is needed to support recent surprising discoveries showing that RNAs are much more powerful and biologically significant than previously realized, discoveries that upset decades of assumptions in molecular biology.The Rfam Database is a collection of functional RNAs not coding for proteins, the so-called non-coding RNAs (ncRNAs). The Rfam Database groups ncRNAs into evolutionarily related families, and searches 8 billion nucleotides of genome sequences for new ncRNAs that are members of these families.This search is done with a Covariance Model (CM), a statistical model based on probabilistic context-free grammars. Although CMs have excellent accuracy, they are infeasibly slow: a pure CM-based implementation of the Rfam Database would require 10,000 CPU years. So, Rfam uses an ad hoc heuristic (based on BLAST, a popular program not specialized for RNAs) to reduce this time to 2,000 CPU weeks---at an unknown cost to sensitivity.This dissertation work significantly improves on CMs by designing CM-based algorithms that are roughly one hundred times faster, yet preserve all or most of the CM's sensitivity. All of these algorithms filter sequences, eliminating unpromising sub-sequences, and running the slow CM only on the most promising sub-sequences.One class of filter, the rigorous filter, guarantees that it will never eliminate sub-sequences that the CM would recognize as an ncRNA. In other words, the use of a rigorous filter cannot compromise sensitivity. Such a filter is unusual in computational biology, where filters typically make no guarantees at all. Our basic rigorous filters use probabilistic regular grammars, with linear inequalities on rule scores guaranteeing rigorousness. More powerful classes of filter exploit limited secondary structure to gain discriminative power without unduly compromising speed.We further develop a class of heuristic filters that scan faster, at a modest cost to sensitivity---a desirable trade-off in many contexts. This dissertation empirically measures speed and sensitivity of heuristic filters on real biological data, providing an objective analysis of various filters' actual performance.These techniques allow Rfam Database searches in roughly the same time as the current solution, but yield new ncRNAs missed by the ad hoc filters that were necessary for practical CM searches until now. These techniques were applied in collaboration with experimental biologists. Among other contributions, these searches (1) led to the first discovery of a naturally occurring RNA (a glycine-binding "riboswitch") that uses cooperative binding, a sophisticated biochemical mechanism previously known only in proteins, and (2) assisted in finding 6S RNA in virtually all groups of bacteria, whereas 6S had been known only in the gamma-proteobacteria group for roughly 30 years.