Voice-Based Human Relaxation Assessment Using Autoencoder-Driven Anomaly Detection of Calm Speech

Kanji Okazaki; Keiichi Watanuki

doi:10.54941/ahfe1007330

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

Voice-Based Human Relaxation Assessment Using Autoencoder-Driven Anomaly Detection of Calm Speech

Open Access

Article

Conference Proceedings

Authors: Kanji Okazaki, Keiichi Watanuki

Abstract

Speech contains paralinguistic cues that reflect psychophysiological states, making it a promising non-contact signal for monitoring mental well-being. However, supervised stress prediction is often impractical because high-stress utterances are difficult to collect at scale, ethically sensitive, and typically weakly labeled. Additionally, real-world datasets frequently exhibit substantial cohort imbalance and repeated measurements per participant. Therefore, we propose a relaxation-first one-class screening approach. Using speech data collected in Tamura City, Fukushima Prefecture, Japan, we first computed a calm score for each recording using an in-house logistic regression-based emotion estimator. We then validated Tamura as a practical relaxed reference cohort through a participant-aware comparison procedure, employing cluster bootstrapping with matched downsampling of an external cohort. The Tamura cohort exhibited a higher mean calm score than the non-Tamura cohort (mean difference = 10.43; Hedges’ g = 0.34; Welch t-test p = 2.5×10⁻⁹), along with a higher rate of “high-calm” samples under a stringent upper-tail criterion (calm ≥ 95.0). Uncertainty was quantified using bootstrap confidence intervals. Using Tamura as the reference distribution for normal data, we trained a denoising autoencoder on standardized 128-dimensional log-Mel summary features, defining the anomaly score as the reconstruction error. The decision threshold (τ= 0.6484) was calibrated by controlling the false-positive rate on normal validation data. The model demonstrated stable convergence (final loss ≈ 0.36) and produced interpretable, deployment-ready outputs: near-threshold normal samples remained below τ (e.g., 0.6338), whereas clear anomalies exceeded it (e.g., 1.016). Overall, this study presents a coherent pipeline linking data constraints, imbalance-aware validation, and autoencoder-based deviation detection, offering a practical approach to low-burden, voice-based relaxation screening. However, we emphasize that “calm” is a model-derived proxy and requires further validation against human-grounded assessments.

Keywords: Voice Biomarker, Relaxation Assessment, Calm Speech, One-class Anomaly Detection, Autoencoder, Log-mel Spectrogram

DOI: 10.54941/ahfe1007330

Cite this paper

Downloads

38

Visits

72

Download PDF

More from this volume

← i-EyFuze: An Eye-Shaped eHMI in Autonomous Vehicles that Provides Intentions for Pedestrians Feasibility study of estimating visuospatial cognition and mental states using eye movement and brain activity during domain-specific tasks →

View all articles in Affective and Pleasurable Design →