Bilkent University
Department of Computer Engineering
CS 590/690 SEMINAR

 

Multimodal Representation Learning With Discrete Speech Tokens for the Automatic Assessment of Depression Severity

 

Uğur Can Altun
Master Student
(Supervisor: Asst.Prof. Hamdi Dibeklioğlu )

Computer Engineering Department
Bilkent University

Abstract: Automating the assessment of depression severity has drawn increasing attention for its potential to enable early diagnosis and intervention. Approaches that span unimodal and multimodal modeling have made significant contributions to estimating depression severity from clinical interviews. Despite advances in multimodal representation learning for evaluating clinical interviews, strong text-only baselines often outperform multimodal systems in estimating depression severity, even though they lack paralinguistic signals such as pitch, timbre, and tonality. To prevent information loss in speech signals and model multimodal signals more efficiently, we propose a speech-first pipeline that tokenizes raw waveforms into discrete units, capturing both linguistic content and fine acoustic detail. The results demonstrate that our approach yields promising results in assessing the severity of depression. This study also demonstrates the significance of learning efficient representations that leverage both semantic and paralinguistic information in speech for accurate assessment of depression severity.

 

DATE: November 17, Monday @ 16:10 Place: EA 502