Emotionally Intelligent Voice Language Models For Mental Health Therapy
Abstract
Conversational assistants for mental health therapy primarily rely on text-based models or cascaded architectures that first transcribe speech to text, a process that discards crucial paralinguistic information. This information bottleneck limits the AI's ability to perceive critical emotional cues and broader psychological states, hindering its capacity to reason over human voice and provide effective therapy. This thesis details the development of an emotionally intelligent voice language model designed to overcome these limitations. The process began with a systematic evaluation of heuristic-based approaches and end-to-end model architectures, where automated benchmarking and human study confirmed that large audio language models provided the most effective foundation for voice understanding and reasoning. Building on these findings, I propose and implement policy optimization methods to fine-tune Qwen2.5-Omni on therapy data. The resulting aligned model demonstrated improved performance in benchmark evaluations, exhibiting emotional intelligence and generating therapeutically relevant responses. By presenting a complete development frame- work, from architectural validation to targeted alignment, this research establishes a clear and proven roadmap for creating the next generation of efficient and adaptive voice language models for mental health therapy and multimodal conversational AI.
Description
Thesis (Master's)--University of Washington, 2025
