Emotionally Intelligent Voice Language Models For Mental Health Therapy

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Conversational assistants for mental health therapy primarily rely on text-based models or cascaded architectures that first transcribe speech to text, a process that discards crucial paralinguistic information. This information bottleneck limits the AI's ability to perceive critical emotional cues and broader psychological states, hindering its capacity to reason over human voice and provide effective therapy. This thesis details the development of an emotionally intelligent voice language model designed to overcome these limitations. The process began with a systematic evaluation of heuristic-based approaches and end-to-end model architectures, where automated benchmarking and human study confirmed that large audio language models provided the most effective foundation for voice understanding and reasoning. Building on these findings, I propose and implement policy optimization methods to fine-tune Qwen2.5-Omni on therapy data. The resulting aligned model demonstrated improved performance in benchmark evaluations, exhibiting emotional intelligence and generating therapeutically relevant responses. By presenting a complete development frame- work, from architectural validation to targeted alignment, this research establishes a clear and proven roadmap for creating the next generation of efficient and adaptive voice language models for mental health therapy and multimodal conversational AI.

Description

Thesis (Master's)--University of Washington, 2025

Citation

DOI