Examination of DNA Electronic Properties Utilizing Machine Learning, Statistical, and Modeling Methods
| dc.contributor.advisor | Anantram, Anant M.P. | |
| dc.contributor.author | Wang, Yiren | |
| dc.date.accessioned | 2025-05-12T22:47:44Z | |
| dc.date.issued | 2025-05-12 | |
| dc.date.submitted | 2025 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2025 | |
| dc.description.abstract | This thesis focuses on the examination of DNA electronic properties through the utilization of statistical feature extraction, machine learning algorithms, density functional theory (DFT), and Green's function method. The common discrepancies observed between experimental and theoretical results have driven us to develop a comprehensive theoretical simulation and modeling approach that is based on experimental outcomes. Initially, we investigate the effect of counterions and solvent dielectric on the conductance of B-form DNA, aiming to comprehend how they modulate DNA electronic properties. Our simulation results indicate that in dry DNA, the presence of counterions affects electron transmission at the lowest unoccupied molecular orbital (LUMO) energies. However, in a solution, counterions play a negligible role in transmission. By employing the polarizable continuum model calculations, we demonstrate that the transmission is significantly higher at both the highest occupied (HOMO) and lowest unoccupied molecular orbital (LUMO) energies in a water environment as opposed to a dry environment. Subsequently, we present a DNA sequence identification system based on the conductance characteristics of short-strand sequences, including a group that differs by a single mismatch. By training a gradient-boosted tree classifier model (XGBoost) on 1D conductance histograms, we achieve remarkably high accuracy, ranging from approximately 96% for sequences with a single mismatch to 99.5% otherwise. These accuracy metrics are attained in near real-time with a minimal number of ten SMBJ (Single Molecule Break Junction) measurements rather than hundreds or thousands. To improve the robustness of the identification system, by targeting specific sequences with a single base mismatch, we propose an approach based on combining XGBoost and a convolutional neural network with different input feature representations: 2D conductance probability distributions, with averaging over the experimental parameters. While the adoption of a 2D probability distribution is helpful with respect to classifier accuracy, we find that averaged conductance probability distributions are much more impactful and significantly enhance the prediction accuracy. Our quantitative analysis of multiple sequences shows an impressive performance boost (approximately 10%) for all sequences. Another key result that emerges from the method developed is evidence that lower voltage bias values produce more accurate classification accuracy. Moreover, to further address the low signal-to-noise ratio inherent in SMBJ measurements, we developed a Piecewise Linear Approximation method, which extracts plateau segments from a noisy time series trace. This method could remove noise content more effectively and keep all useful conductance information unchanged. The resulting conductance histograms, which are constructed purely from plateau segments, provide huge benefits in training subsequence machine learning models and focus on learning essential conductance features. Even with a sample size as small as five, our classification system is able to maintain its high accuracy. The enhanced machine learning approach with the Piecewise Linear Approximation method could significantly improve the efficiency and accuracy of SMBJ methods, making them more viable for practical applications. Although the basis of the analysis in this thesis is time series conductance data of less than a hundred DNA/RNA strands and their single base mismatches, our method is generally applicable to other single-molecule electrical data. We posit that our approach represents an emerging alternative to existing DNA sequence identification methods and should be of use in analyzing single molecule conductance data for sequence identification. | |
| dc.embargo.lift | 2026-05-12T22:47:44Z | |
| dc.embargo.terms | Restrict to UW for 1 year -- then make Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Wang_washington_0250E_27327.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/52977 | |
| dc.language.iso | en_US | |
| dc.rights | CC BY-NC | |
| dc.subject | DNA | |
| dc.subject | Machine Learning | |
| dc.subject | Electrical engineering | |
| dc.subject.other | Electrical and computer engineering | |
| dc.title | Examination of DNA Electronic Properties Utilizing Machine Learning, Statistical, and Modeling Methods | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Wang_washington_0250E_27327.pdf
- Size:
- 14.07 MB
- Format:
- Adobe Portable Document Format
