Search algorithms for biosequences using random projection

Buhler, Jeremy

Search algorithms for biosequences using random projection

dc.contributor.author	Buhler, Jeremy	en_US
dc.date.accessioned	2009-10-06T16:52:37Z
dc.date.available	2009-10-06T16:52:37Z
dc.date.issued	2001	en_US
dc.description	Thesis (Ph. D.)--University of Washington, 2001	en_US
dc.description.abstract	The recent explosion in the availability of long contiguous genomic sequences, including the complete genomes of several organisms, poses substantial challenges for bioinformatics. In particular, algorithms must be developed for annotating biologically meaningful features in multimegabase DNA sequences either by observing their similarity to known genes, regulatory sites, and other features or to conserved copies of the same features in an equally long sequence from another organism. Annotation on such a large scale must be both computationally efficient and sensitive enough to recover subtle but significant features that would otherwise be lost in a mass of unannotated, and hence undifferentiated, sequence.This work explores algorithms for biosequence annotation that use random projection, a technique borrowed from high-dimensional computational geometry. Random projection reduces computationally challenging problems of inexact string matching to a series of more tractable exact matching problems in exchange for a formally quantifiable and practically small loss in sensitivity. Applied to biosequences, the technique permits efficient comparison of very long sequences to discover local alignments corresponding to meaningful features, including some that are practically inaccessible to existing annotation tools. Specific applications that benefit from random projection's increased sensitivity and/or efficiency include comparisons of long orthologous sequences, whole-genome repeat finding, and discovery of regulatory motifs.Highlights of the thesis include: the LSH-ALL-PAIRS algorithm for discovering all high-scoring ungapped local alignments between pairs of substrings of one or more long sequences, without relying on the presence of long exact matches in these alignments; the PROJECTION motif finding algorithm, which extends random projection to a multiple alignment context and discovers motifs that are inaccessible to existing motif finders; and the development of score simulation, a theory for building sparse indices or fingerprints of a biosequence such that the probability that two sequences' fingerprints match increases with their similarity under a given alignment score function.	en_US
dc.embargo.terms	Manuscript available on the University of Washington campuses and via UW NetID. Full text may be available via ProQuest's Dissertations and Theses Full Text database or through your local library's interlibrary loan service.
dc.format.extent	xix, 173 p.	en_US
dc.identifier.other	b46551013	en_US
dc.identifier.other	48805656	en_US
dc.identifier.other	Thesis 50676	en_US
dc.identifier.uri	http://hdl.handle.net/1773/6919
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.rights.uri		en_US
dc.subject.other	Theses--Computer science and engineering	en_US
dc.title	Search algorithms for biosequences using random projection	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3022813.pdf
Size:: 7.87 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering