Is LLM a reliable risk detector? An evaluation of large language models in EMR-related medical incident detection

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

Is LLM a reliable risk detector? An evaluation of large language models in EMR-related medical incident detection

Open Access

Article

Conference Proceedings

Authors: Siyuan Zhang, Xiuzhu Gu

Abstract: Medical institutions typically rely on manual analysis of adverse medical events, which requires significant human resources, time, and specialized knowledge and expertise. These requirements reduce the effectiveness of identifying potential risks. Can large language models (LLMs) leverage their powerful natural language processing capabilities to function as reliable risk detectors? In this pilot study, we aim to evaluate the effectiveness of LLMs in identifying electronic medical record system (EMR)-related medical incident risks. We first curated a dataset comprising 573 medical incident reports that had been manually analyzed. Then, using a few-shot prompting approach, we designed instructions to evaluate five LLMs, including GPT-4o, Claude 3.5 Sonnet, DeepSeek V3, Nova Pro, and Llama 3.1-405b. The results indicated that the best-performing LLMs could accurately extract more than half of the risk factors and generate reasonable explanations grounded in real-world case contexts. While general-purpose LLMs can provide some assistance, further optimization tailored to specific medical scenarios is necessary to enhance their capability in handling complex cases.

Keywords: Healthcare safety, Large Language Models, Medical incidents, Prompt engineering, Risk factors

DOI: 10.54941/ahfe1006630

Cite this paper:

Downloads

25

Visits

96