Search algorithms for biosequences using random projection
The recent explosion in the availability of long contiguous genomic sequences, including the complete genomes of several organisms, poses substantial challenges for bioinformatics. In particular, algorithms must be developed for annotating biologically meaningful features in multimegabase DNA sequences either by observing their similarity to known genes, regulatory sites, and other features or to conserved copies of the same features in an equally long sequence from another organism. Annotation on such a large scale must be both computationally efficient and sensitive enough to recover subtle but significant features that would otherwise be lost in a mass of unannotated, and hence undifferentiated, sequence.This work explores algorithms for biosequence annotation that use random projection, a technique borrowed from high-dimensional computational geometry. Random projection reduces computationally challenging problems of inexact string matching to a series of more tractable exact matching problems in exchange for a formally quantifiable and practically small loss in sensitivity. Applied to biosequences, the technique permits efficient comparison of very long sequences to discover local alignments corresponding to meaningful features, including some that are practically inaccessible to existing annotation tools. Specific applications that benefit from random projection's increased sensitivity and/or efficiency include comparisons of long orthologous sequences, whole-genome repeat finding, and discovery of regulatory motifs.Highlights of the thesis include: the LSH-ALL-PAIRS algorithm for discovering all high-scoring ungapped local alignments between pairs of substrings of one or more long sequences, without relying on the presence of long exact matches in these alignments; the PROJECTION motif finding algorithm, which extends random projection to a multiple alignment context and discovers motifs that are inaccessible to existing motif finders; and the development of score simulation, a theory for building sparse indices or fingerprints of a biosequence such that the probability that two sequences' fingerprints match increases with their similarity under a given alignment score function.