Human–AI Collaboration in Automated X-Ray Screening: Effects of Alarm Types and Reliability Levels on Operator Performance in Subway Security
Abstract
Public-transportation X-ray checkpoints increasingly integrate Automated Diagnostic Aid Systems (ADAS) to support threat detection, yet system-level success remains contingent on human operators’ vigilance, decision strategies, and calibrated trust in automation. To promote the joint performance of human–AI collaboration, this study examines how alarm design (i.e., the way AI provides diagnostic advice) should vary with automation reliability. We experimentally compared three alarm modalities—binary alarm (“danger/safe”), likelihood alarm (four-level graded advice: “danger/warning/possible-safe/safe”), and automated decision (the system hides “safe” images and forwards only “danger” cases for human review)—across three system reliability levels (70%, 80%, 90%). Twenty-one participants completed X-ray baggage search tasks with target prevalence set as 30%; after quality control, 18 datasets (n=6 per reliability) were analyzed. Primary objective measures were d′ sensitivity (Signal Detection Theory) and response time (RT) for target-present and target-absent decisions; subjective measures captured multi-dimensional trust.As automated decision triages images and alters the decision space, SDT analyses focused on binary vs. likelihood conditions, with alarm type as a within-group variable and reliability as a between-group variable. A two-way mixed ANOVA revealed a significant main effect of alarm type and a significant alarm-type × reliability interaction (Alarm Type: F=10.88, p<.05; Interaction: F=11.63, p<.05). For binary alarms, operator d′ increased monotonically with ADAS reliability (from 70% to 90%), indicating that categorical cues benefit from high classifier accuracy. For likelihood alarms, d′ improved from 70% to 80% but declined at 90%, suggesting that when the AI is highly accurate, graded, ambiguous messages can impose avoidable decisional complexity and cognitive load, degrading sensitivity relative to simpler cues. RT analyses did not yield reliable omnibus effects, though patterns were consistent with the interpretation that richer advice requires additional decisional processing, especially for target-absent judgments.Subjective results complemented the objective pattern. At 70% reliability, participants preferred likelihood alarms, rating them higher on perceived competence/faith/reliability, consistent with the idea that greater transparency and nuance are valuable when automation is imperfect. At 90% reliability, participants expressed the highest trust and willingness to rely on automated decision, reflecting comfort with delegating routine “safe” triage to a highly reliable AI and reserving human involvement for flagged “danger” cases. Across conditions, trust calibration tracked reliability, but critically depended on the alarm form through which the AI conveyed its assessment.Contributions. (1) We provide an empirical mapping between alarm granularity and automation reliability, demonstrating that the optimal alarm type depends on the AI’s operating performance. (2) We show that graded likelihood cues can enhance sensitivity at lower automation reliability by supporting informed human override, but they can reduce performance at high reliability by adding decisional friction. (3) We integrate SDT-based sensitivity with multi-dimensional trust to articulate actionable design guidance for human–AI teaming in safety-critical screening.Implications for design. To maximize human–AI system performance, alarm transparency should be matched to system reliability. Likelihood-based alarms are preferable when reliability is modest, as they support human verification and facilitate appropriate criterion setting. When reliability is high, binary or automated-decision modes are recommended to minimize cognitive load and enable efficient triage. Practically, an adaptive alarm policy that switches alarm type as real-time reliability estimates change may best sustain calibrated trust, operator efficiency, and system-level sensitivity in high-throughput subway screening.
Keywords: human–AI collaboration, X-ray screening, alarm design
DOI: 10.54941/ahfe1007456
Cite this paper
More from this volume
- Artificial intelligence uses and loneliness: Examining the relationship between artificial intelligence usage patterns, need to belong and loneliness
- Trust in AI in commercial aviation maintenance: Gaining efficiencies while enhancing safety
- From Outcomes to Experience: Designing AI to Support Agency, Collaboration, and Calibrated Trust in Creative Work
- Quality of Life in Contemporary Society: Social Dimensions in the Context of Digitalization and Artificial Intelligence
- Skill Development, Maintenance, Erosion, and Revaluation: How Knowledge Workers Experience Generative AI
- AI-empowered Design of Museum Cultural and Creative Products: Consumer Perception of Creativity and Its Impact on Consumption Decision-making
- From Result Imitation to Cultural Translation: An Intelligent Generation Approach for Dong Brocade Patterns Based on Patternology
- National Systematization for Voluntary Local Reports (VLR) of the 2030 Agenda; Municipalities of Mexico
- AIGC as the Third Space for Cultural Innovation Design


AHFE Open Access