Search algorithms for biosequences using random projection

dc.contributor.authorBuhler, Jeremyen_US
dc.date.accessioned2009-10-06T16:52:37Z
dc.date.available2009-10-06T16:52:37Z
dc.date.issued2001en_US
dc.descriptionThesis (Ph. D.)--University of Washington, 2001en_US
dc.description.abstractThe recent explosion in the availability of long contiguous genomic sequences, including the complete genomes of several organisms, poses substantial challenges for bioinformatics. In particular, algorithms must be developed for annotating biologically meaningful features in multimegabase DNA sequences either by observing their similarity to known genes, regulatory sites, and other features or to conserved copies of the same features in an equally long sequence from another organism. Annotation on such a large scale must be both computationally efficient and sensitive enough to recover subtle but significant features that would otherwise be lost in a mass of unannotated, and hence undifferentiated, sequence.This work explores algorithms for biosequence annotation that use random projection, a technique borrowed from high-dimensional computational geometry. Random projection reduces computationally challenging problems of inexact string matching to a series of more tractable exact matching problems in exchange for a formally quantifiable and practically small loss in sensitivity. Applied to biosequences, the technique permits efficient comparison of very long sequences to discover local alignments corresponding to meaningful features, including some that are practically inaccessible to existing annotation tools. Specific applications that benefit from random projection's increased sensitivity and/or efficiency include comparisons of long orthologous sequences, whole-genome repeat finding, and discovery of regulatory motifs.Highlights of the thesis include: the LSH-ALL-PAIRS algorithm for discovering all high-scoring ungapped local alignments between pairs of substrings of one or more long sequences, without relying on the presence of long exact matches in these alignments; the PROJECTION motif finding algorithm, which extends random projection to a multiple alignment context and discovers motifs that are inaccessible to existing motif finders; and the development of score simulation, a theory for building sparse indices or fingerprints of a biosequence such that the probability that two sequences' fingerprints match increases with their similarity under a given alignment score function.en_US
dc.embargo.termsManuscript available on the University of Washington campuses and via UW NetID. Full text may be available via ProQuest's Dissertations and Theses Full Text database or through your local library's interlibrary loan service.
dc.format.extentxix, 173 p.en_US
dc.identifier.otherb46551013en_US
dc.identifier.other48805656en_US
dc.identifier.otherThesis 50676en_US
dc.identifier.urihttp://hdl.handle.net/1773/6919
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.rights.urien_US
dc.subject.otherTheses--Computer science and engineeringen_US
dc.titleSearch algorithms for biosequences using random projectionen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3022813.pdf
Size:
7.87 MB
Format:
Adobe Portable Document Format