Human–AI Collaboration in Automated X-Ray Screening: Effects of Alarm Types and Reliability Levels on Operator Performance in Subway Security

Xin Zhou

doi:10.54941/ahfe1007456

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

Human–AI Collaboration in Automated X-Ray Screening: Effects of Alarm Types and Reliability Levels on Operator Performance in Subway Security

Open Access

Article

Conference Proceedings

Authors: Xin Zhou

Abstract

Public-transportation X-ray checkpoints increasingly integrate Automated Diagnostic Aid Systems (ADAS) to support threat detection, yet system-level success remains contingent on human operators’ vigilance, decision strategies, and calibrated trust in automation. To promote the joint performance of human–AI collaboration, this study examines how alarm design (i.e., the way AI provides diagnostic advice) should vary with automation reliability. We experimentally compared three alarm modalities—binary alarm (“danger/safe”), likelihood alarm (four-level graded advice: “danger/warning/possible-safe/safe”), and automated decision (the system hides “safe” images and forwards only “danger” cases for human review)—across three system reliability levels (70%, 80%, 90%). Twenty-one participants completed X-ray baggage search tasks with target prevalence set as 30%; after quality control, 18 datasets (n=6 per reliability) were analyzed. Primary objective measures were d′ sensitivity (Signal Detection Theory) and response time (RT) for target-present and target-absent decisions; subjective measures captured multi-dimensional trust.As automated decision triages images and alters the decision space, SDT analyses focused on binary vs. likelihood conditions, with alarm type as a within-group variable and reliability as a between-group variable. A two-way mixed ANOVA revealed a significant main effect of alarm type and a significant alarm-type × reliability interaction (Alarm Type: F=10.88, p<.05; Interaction: F=11.63, p<.05). For binary alarms, operator d′ increased monotonically with ADAS reliability (from 70% to 90%), indicating that categorical cues benefit from high classifier accuracy. For likelihood alarms, d′ improved from 70% to 80% but declined at 90%, suggesting that when the AI is highly accurate, graded, ambiguous messages can impose avoidable decisional complexity and cognitive load, degrading sensitivity relative to simpler cues. RT analyses did not yield reliable omnibus effects, though patterns were consistent with the interpretation that richer advice requires additional decisional processing, especially for target-absent judgments.Subjective results complemented the objective pattern. At 70% reliability, participants preferred likelihood alarms, rating them higher on perceived competence/faith/reliability, consistent with the idea that greater transparency and nuance are valuable when automation is imperfect. At 90% reliability, participants expressed the highest trust and willingness to rely on automated decision, reflecting comfort with delegating routine “safe” triage to a highly reliable AI and reserving human involvement for flagged “danger” cases. Across conditions, trust calibration tracked reliability, but critically depended on the alarm form through which the AI conveyed its assessment.Contributions. (1) We provide an empirical mapping between alarm granularity and automation reliability, demonstrating that the optimal alarm type depends on the AI’s operating performance. (2) We show that graded likelihood cues can enhance sensitivity at lower automation reliability by supporting informed human override, but they can reduce performance at high reliability by adding decisional friction. (3) We integrate SDT-based sensitivity with multi-dimensional trust to articulate actionable design guidance for human–AI teaming in safety-critical screening.Implications for design. To maximize human–AI system performance, alarm transparency should be matched to system reliability. Likelihood-based alarms are preferable when reliability is modest, as they support human verification and facilitate appropriate criterion setting. When reliability is high, binary or automated-decision modes are recommended to minimize cognitive load and enable efficient triage. Practically, an adaptive alarm policy that switches alarm type as real-time reliability estimates change may best sustain calibrated trust, operator efficiency, and system-level sensitivity in high-throughput subway screening.

Keywords: human–AI collaboration, X-ray screening, alarm design

DOI: 10.54941/ahfe1007456

Cite this paper

Downloads

36

Visits

77

Download PDF

More from this volume

← From Outcomes to Experience: Designing AI to Support Agency, Collaboration, and Calibrated Trust in Creative Work Quality of Life in Contemporary Society: Social Dimensions in the Context of Digitalization and Artificial Intelligence →

View all articles in Global Issues Challenge: Challenges in AI at the Human Level →