Search algorithms for biosequences using random projection

ResearchWorks/Manakin Repository

Search ResearchWorks


Advanced Search

Browse

My Account

Statistics

Related Information

Search algorithms for biosequences using random projection

Show full item record

Title: Search algorithms for biosequences using random projection
Author: Buhler, Jeremy
Abstract: The recent explosion in the availability of long contiguous genomic sequences, including the complete genomes of several organisms, poses substantial challenges for bioinformatics. In particular, algorithms must be developed for annotating biologically meaningful features in multimegabase DNA sequences either by observing their similarity to known genes, regulatory sites, and other features or to conserved copies of the same features in an equally long sequence from another organism. Annotation on such a large scale must be both computationally efficient and sensitive enough to recover subtle but significant features that would otherwise be lost in a mass of unannotated, and hence undifferentiated, sequence.This work explores algorithms for biosequence annotation that use random projection, a technique borrowed from high-dimensional computational geometry. Random projection reduces computationally challenging problems of inexact string matching to a series of more tractable exact matching problems in exchange for a formally quantifiable and practically small loss in sensitivity. Applied to biosequences, the technique permits efficient comparison of very long sequences to discover local alignments corresponding to meaningful features, including some that are practically inaccessible to existing annotation tools. Specific applications that benefit from random projection's increased sensitivity and/or efficiency include comparisons of long orthologous sequences, whole-genome repeat finding, and discovery of regulatory motifs.Highlights of the thesis include: the LSH-ALL-PAIRS algorithm for discovering all high-scoring ungapped local alignments between pairs of substrings of one or more long sequences, without relying on the presence of long exact matches in these alignments; the PROJECTION motif finding algorithm, which extends random projection to a multiple alignment context and discovers motifs that are inaccessible to existing motif finders; and the development of score simulation, a theory for building sparse indices or fingerprints of a biosequence such that the probability that two sequences' fingerprints match increases with their similarity under a given alignment score function.
Description: Thesis (Ph. D.)--University of Washington, 2001
URI: http://hdl.handle.net/1773/6919
Author requested restriction: Manuscript available on the University of Washington campuses and via UW NetID. Full text may be available via ProQuest's Dissertations and Theses Full Text database or through your local library's interlibrary loan service.

Files in this item

Files Size Format View
3022813.pdf 7.871Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record