Assessment of the Capabilities of Multimodal Large Language Models in Locating and Resolving Ambiguities during Human-Robot Teaming

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

Assessment of the Capabilities of Multimodal Large Language Models in Locating and Resolving Ambiguities during Human-Robot Teaming

Open Access

Article

Conference Proceedings

Authors: William Valentine, Michael Wollowski

Abstract: Human-robot teaming is bound by the quality of communication. This includes maintaining a context among the team. Our work studies the quality of ambiguity identification and resolution by Multimodal Large Language Models (MMLLMs) towards creating a clear context for teams. We developed a benchmark of images with associated ambiguous queries to replicate a teaming context with a human collaborator. We evaluated the performance of several MMLLMs on this benchmark to assess their capabilities in identifying and resolving ambiguities. We created a testing framework in which the MMLLM processes commands accompanied by an image and then evaluates the model's performance in detecting and resolving ambiguities. To create a shared context between our human and robot collaborators, our system provides a picture that captures the viewpoint of the robot as well as a query provided by the human collaborator. The chosen MMLLM processes this information and outputs both portions of the query that are ambiguous as well as suggestions for clarification. A corrected version of the prompt may then be sent to a planner or a system that provides actionable commands. To evaluate each MMLLM's performance, we compare the ambiguities identified by the model with the expected ambiguities from the datasets. We found an 81% accuracy for the top-performing MMLLM.

Keywords: Human-AI Collaboration, Ambiguity Resolution, Multimodal Large Language Models

DOI: 10.54941/ahfe1006048

Cite this paper:

Downloads

134

Visits

281