Exploring User Behavior and Validation Proficiency in Assessing Responses from a Conversational Agent

Open Access
Article
Conference Proceedings
Authors: Jiayin HuangJonggi Hong

Abstract: Large‐language–model (LLM) chatbots are rapidly becoming everyday information sources, yet little is known about how ordinary users verify their accuracy, especially in high-stakes domains such as health. This study investigates how and how well people validate ChatGPT’s answers when they can, or cannot, consult complementary web search results. Understanding these behaviors is essential for designing conversational systems that actively support responsible use rather than amplify misinformation.We conducted a within-subjects study with fifteen participants (7 women, 8 men, aged 22–27) recruited on a U.S. university campus. The topic space was deliberately unfamiliar but consequential: the 30-item Alzheimer’s Disease Knowledge Scale (ADKS). For each item, GPT-3.5-turbo produced a true/false response (93.33 % correct, 6.67 % intentionally incorrect). Participants completed two phases over Zoom (mean duration ≈ 67 min). Phase 1 displayed only the ChatGPT answer. Phase 2 added ten pre-collected, fully clickable Google snippets beside the same answer. Snippets were retrieved with Google Custom Search API using (a) the full question and (b) automatically extracted keywords; cosine similarity scores (BERT) were shown to signal textual overlap.Behavioral data (selection of “Correct”, “Incorrect”, or “I’m not sure”), click logs, and per-item decision times were recorded. Validation proficiency was quantified with precision, recall, F1, and the underlying counts of true/false positives and negatives. Normality was checked with Shapiro–Wilk tests; paired t tests or Wilcoxon signed-rank tests were applied accordingly. Semi-structured questionnaires before and after the task captured self-reported search habits and perceived usefulness of the two snippet types; open responses were thematically coded.Access to search results significantly improved recall from 0.70 (SD 0.10) in Phase 1 to 0.77 (SD 0.14); t₁₄ = –2.35, p = .034, d = 0.60. Participants therefore overlooked fewer correct answers (false negatives decreased from 8.40 to 6.33). F1 rose modestly from 0.80 to 0.84 (n.s.), while precision showed a non-significant downward trend because false positives increased (1.00 → 1.47; p = .052). Mean validation time per item doubled (38.7 s → 82.0 s), indicating higher cognitive effort. Link-click analysis revealed that deeper information gathering correlated positively with precision (r = .53) and true-negative detections, whereas superficial inspection fostered over-acceptance of incorrect answers. Qualitative feedback confirmed that participants prized authoritative domains (e.g., NIH, Mayo Clinic) and preferred question-based queries over keyword queries (86 % vs 20 % “very useful”). Nevertheless, two intentionally erroneous ChatGPT statements about rare recovery and tremor symptoms remained widely believed, showing that additional context does not automatically resolve misconceptions.Our findings highlight the limitations of simply adding external information sources without guidance. While users benefitted from authoritative links, they still struggled with vague expressions and misunderstood incorrect answers. Future work may explore design solutions such as more structured presentation of search results, interactive validation support, or automated detection of vague or misleading language in LLMs. Additionally, some participants found keyword queries unhelpful, suggesting that query design and information literacy training may play an important role.

Keywords: Human-centered computing, Empirical studies in HCI, Human-computer interaction (HCI), Computing methodologies

DOI: 10.54941/ahfe1006707

Cite this paper:

Downloads
45
Visits
136
Download