Is LLM a reliable risk detector? An evaluation of large language models in EMR-related medical incident detection
Open Access
Article
Conference Proceedings
Authors: Siyuan Zhang, Xiuzhu Gu
Abstract: Medical institutions typically rely on manual analysis of adverse medical events, which requires significant human resources, time, and specialized knowledge and expertise. These requirements reduce the effectiveness of identifying potential risks. Can large language models (LLMs) leverage their powerful natural language processing capabilities to function as reliable risk detectors? In this pilot study, we aim to evaluate the effectiveness of LLMs in identifying electronic medical record system (EMR)-related medical incident risks. We first curated a dataset comprising 573 medical incident reports that had been manually analyzed. Then, using a few-shot prompting approach, we designed instructions to evaluate five LLMs, including GPT-4o, Claude 3.5 Sonnet, DeepSeek V3, Nova Pro, and Llama 3.1-405b. The results indicated that the best-performing LLMs could accurately extract more than half of the risk factors and generate reasonable explanations grounded in real-world case contexts. While general-purpose LLMs can provide some assistance, further optimization tailored to specific medical scenarios is necessary to enhance their capability in handling complex cases.
Keywords: Healthcare safety, Large Language Models, Medical incidents, Prompt engineering, Risk factors
DOI: 10.54941/ahfe1006630
Cite this paper:
Downloads
0
Visits
15