Natural Language Processing for Education Research: Exploring Strategic Use of Traditional and Large Language Topic Models

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Engineering education research increasingly relies on qualitative analysis of short, open-ended survey responses to understand student experiences across courses and institutions, but extracting reliable themes from these texts at scale requires methods that balance computational efficiency with interpretive rigor. While Natural Language Processing (NLP)has been applied in education for automated grading and sentiment analysis, its systematic integration with qualitative thematic analysis for short, prompt-guided educational research texts has received limited attention. This dissertation addresses that gap by comparing five topic modeling methods on short student feedback on instructional support and by developing the NLP-Assisted Thematic Analysis framework, a six-stage workflow that embeds domain expert judgment from data preparation through final validation. Three survey datasets of undergraduate engineering student responses on faculty support, teaching assistant (TA) support, and peer support (1,667, 1,592, and 1,376 responses, respectively, for approximately 4,600 total) were processed through a standardized preprocessing pipeline and evaluated against expert-coded themes. Five methods were compared:k-means clustering, Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), BERTopic with MiniLM and MPNet sentence embeddings, and zero-shot classification (ZSC). Performance was evaluated using accuracy, macro and weighted F1, topic coherence, and inter-rater reliability (Cohen’s κ). Ground truth was established by two approaches: (a) a machine-led approach in which topic model keywords guided manual coding of the data; and (b) a human-led approach in which a domain expert coded the data independently. Results varied by dataset, with no single method performing best across all three corpora. BERTopic MiniLM performed best on the concise, low-ambiguity peer support corpus (85% accuracy, 77% macro-F1), with LDA second at 78% accuracy and 67% macro-F1. BERTopic MPNet led on faculty support, where over one quarter of responses addressed overlapping themes (76.8% accuracy, 65.7% macro-F1), with NMF close behind on accuracy (76.76% accuracy, 62.82% macro-F1). TA support was the most challenging dataset due to higher thematic ambiguity and misalignment between model-generated topics and expert-identified themes. ZSC, applied to the peer support dataset, reached 85% accuracy and 60% weighted F1 when prompts used mainstream language, compared to 82% accuracy and 56% weighted F1 with domain-specific prompts.The NLP-Assisted Thematic Analysis framework structures domain expert involvement across six stages of the analysis pipeline, from data preparation through final validation. Expert review consolidated nine algorithmic topics into five research themes, with inter-rater reliability (Cohen’s κ) between 0.72 and 0.75 across all three datasets. Targeted interventions, including domain-specific stopword curation, hyperparameter selection, topic-to-theme bridging, and review of algorithmically uncertain responses, improved macro-F1 by up to 14 percentage points. The largest single gain arose from BERTopic outlier review on the TA support dataset, raising macro-F1 from 54.2% to 69.3%. These results establish performance benchmarks for five NLP methods on short educational research text, identify where domain expert involvement has the greatest impact on accuracy and interpretive quality, and provide the NLP-Assisted Thematic Analysis framework as a reproducible, decision-guided protocol for researchers applying topic modeling to qualitative survey data in education and related fields.

Description

Thesis (Ph.D.)--University of Washington, 2026

Citation

DOI