StackBERT-Enhancer: A Dual-Layer BERT-Based Framework for Enhancer Identification and Strength Classification in Genomic Data
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Accurately identifying and classifying crucial regulatory DNA sequences known as enhancers is a significant challenge, as traditional computational methods often struggle with their complex, context-dependent nature and lack interpretability. This thesis introduces StackBERT-Enhancer, a novel deep learning framework to address these limitations, focusing on two primary tasks: distinguishing enhancer sequences from non-enhancer sequences and classifying identified enhancers by their activity levels. The proposed framework employs multiple transformer-based language models, each independently trained on DNA sequences tokenized with different k-mer sizes, allowing for the capture of sequence dependencies across various scales. These individual models are then integrated into a stacking ensemble architecture, which significantly boosts classification accuracy, robustness, and generalization, achieving state-of-the-art results of 83.5% in enhancer identification and 99.0% in enhancer strength classification. The framework utilizes distributed multi-GPU systems for efficient model training and incorporates interpretability techniques such as SHapley Additive exPlanations (SHAP) for feature importance and attention score analysis for sequence motif discovery, bridging predictive power with biological insight. This advanced approach offers a robust and interpretable tool for enhancer analysis, holding strong potential for applications in disease modeling and broader biomedical research.
Description
Thesis (Master's)--University of Washington, 2025
