Content-based Similarity Search in DNA Data Storage Systems

Bee, Callista LavenderContent-based Similarity Search in DNA Data Storage SystemsMy University2021DNA computingDNA digital storagemolecular programmingsimilarity searchComputer scienceComputer science and engineeringMy UniversityMy UniversityCeze, Luis2021-03-192021-03-192021-03-192020en-USThesisBee_washington_0250E_22399.pdfhttp://hdl.handle.net/1773/46761application/pdfCC BYThesis (Ph.D.)--University of Washington, 2020As global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename. However, these approaches lack the ability to perform more complex computations over the stored data, such as content-based search. Here, we demonstrate the design, implementation, and evaluation of techniques for executing similarity search in DNA-based databases. By using machine learning to build a predictor of DNA hybridization reactions, we are able to create an encoding from images to DNA sequences that is optimized for similarity search. With this encoding, an encoded query image is most likely to hybridize with targets that are encoded from images visually similar to the query. This allows a query molecule to act as a molecular filter, which can select relevant results from a large database. We perform wetlab experiments with a database of 1.6 million images encoded and synthesized as DNA molecules, and show that our technique produces results which are comparable to those of state-of-the-art electronic implementations of similarity search. By demonstrating that DNA-based systems are capable of both storage and computation, we believe this work will encourage further development of this emerging technology.