Voice-Based Human Relaxation Assessment Using Autoencoder-Driven Anomaly Detection of Calm Speech
Abstract
Speech contains paralinguistic cues that reflect psychophysiological states, making it a promising non-contact signal for monitoring mental well-being. However, supervised stress prediction is often impractical because high-stress utterances are difficult to collect at scale, ethically sensitive, and typically weakly labeled. Additionally, real-world datasets frequently exhibit substantial cohort imbalance and repeated measurements per participant. Therefore, we propose a relaxation-first one-class screening approach. Using speech data collected in Tamura City, Fukushima Prefecture, Japan, we first computed a calm score for each recording using an in-house logistic regression-based emotion estimator. We then validated Tamura as a practical relaxed reference cohort through a participant-aware comparison procedure, employing cluster bootstrapping with matched downsampling of an external cohort. The Tamura cohort exhibited a higher mean calm score than the non-Tamura cohort (mean difference = 10.43; Hedges’ g = 0.34; Welch t-test p = 2.5×10⁻⁹), along with a higher rate of “high-calm” samples under a stringent upper-tail criterion (calm ≥ 95.0). Uncertainty was quantified using bootstrap confidence intervals. Using Tamura as the reference distribution for normal data, we trained a denoising autoencoder on standardized 128-dimensional log-Mel summary features, defining the anomaly score as the reconstruction error. The decision threshold (τ= 0.6484) was calibrated by controlling the false-positive rate on normal validation data. The model demonstrated stable convergence (final loss ≈ 0.36) and produced interpretable, deployment-ready outputs: near-threshold normal samples remained below τ (e.g., 0.6338), whereas clear anomalies exceeded it (e.g., 1.016). Overall, this study presents a coherent pipeline linking data constraints, imbalance-aware validation, and autoencoder-based deviation detection, offering a practical approach to low-burden, voice-based relaxation screening. However, we emphasize that “calm” is a model-derived proxy and requires further validation against human-grounded assessments.
Keywords: Voice Biomarker, Relaxation Assessment, Calm Speech, One-class Anomaly Detection, Autoencoder, Log-mel Spectrogram
DOI: 10.54941/ahfe1007330
Cite this paper
More from this volume
- An Embodied Interaction System for Five-Tone Music Therapy: A Guqin-Inspired Multimodal Design
- Beyond Function: An Analysis of Affective Design Factors in Japanese Mechanical Watches with High Auction Prices
- Environment Providing Necessary Information to Users Using Multiple IoT Avatars
- i-EyFuze: An Eye-Shaped eHMI in Autonomous Vehicles that Provides Intentions for Pedestrians
- Feasibility study of estimating visuospatial cognition and mental states using eye movement and brain activity during domain-specific tasks
- Deep Learning of Latent Gaze Representations for Cognitive Ability and Mental State Estimation
- Lightweight Driver Drowsiness Detection Model Using MediaPipe Blendshapes and a Dual-Attention Hierarchical BiLSTM
- Estimating 3D Ground Reaction Forces During Gait Using a Deep Learning Model with IMU and Plantar Pressure Data
- Integrating SOR and TAM Models to Explore Consumer Emotions and Preferences in Fur Fashion Design
- Effect of Changing Task Sequence on Physical Workload in Agricultural Operations
- Influence of Social Appearance Attributes of Cyber Driving Support Agents on the Passenger Effect
- Design of a Community-Based Digital Platform for Standardized Stray Cat Rescue Based on Service System Design


AHFE Open Access