Magenta: Metrics and Evaluation Framework for Generative Agents based on LLMs
Abstract
Large Language Models (LLMs) have emerged as a driving force in the field of Natural Language Processing (NLP) with applications spanning various domains, including the development of Autonomous Generative Agents. Generative Agents are computational software programs designed to believably simulate human behavior by harnessing the capabilities of large language models. Through repetitive prompts against the large language model, these agents operate based on a system architecture consisting of memory streams, reflection, and planning, allowing them to store experiences, learn from them, and translate insights into high-level action plans to interact with their environment. This paper discusses the current landscape of language models and autonomous agents, their advantages and challenges, and the current state of evaluation, and proposes an innovative evaluation benchmark designed to provide a holistic perspective on their performance. Additionally, we see the impact of fine-tuning such an LLM, evaluate using our benchmark, and then propose a framework for evaluation of both the agents and their underlying LLMs. The existing frameworks for evaluating LLMs and autonomous agents focus on single tasks and are limited in capturing their capabilities. We outline the methodology for evaluating autonomous agents' performance in responding to single and multi-step prompts. The process consists of three key stages: Preparation of the data, Preparation of the Gold Answers, and Evaluations. We use the meticulously crafted 20 unique prompts to challenge the agents, covering simple and complex questions. Using GPT-4, a state-of-the-art model, we generate the initial responses, which undergo rigorous verification to produce gold answers, indicating correctness and revealing the minimum steps required for task completion. Our evaluation framework relies on two critical metrics: the effort metrics, quantifying the steps taken by autonomous agents, and the success rate, measuring their accuracy in achieving task objectives and also keeping track of hallucinations of the model. We conduct experiments with ten different models, representing the current landscape of natural language processing models, presenting each with 20 unique prompts. Their responses are meticulously compared to our gold answers and gold steps (optimal number of steps) to generate the evaluation metrics. Similarly, a fine-tuned model is also evaluated with ten different questions, which test the agent's decision-making process by selecting the correct tool and then the ability of the model to reach the correct conclusion to the question asked by the user in this process.This comprehensive approach ensures a thorough assessment of autonomous agents' capabilities. It demonstrates the utility of these metrics, revealing how they can shed light on the strengths and weaknesses of various autonomous agents. As a step toward standardization, we propose transforming the evaluation process of LLMs into an automated framework that accommodates all types of language models, agents, and LLM-based applications. Such an approach promises to establish a unified and comprehensive evaluation methodology, empowering users to make informed decisions when selecting, fine-tuning, and assessing the accuracy of underlying language models and their applications for different domains.In summary, this paper contributes to the ongoing research on evaluating LLMs and autonomous agents by introducing a novel benchmark and proposing a framework, focusing on evaluating the language models while keeping different knowledge domains in mind. Our framework will enhance our understanding of these technologies and serve as a valuable resource for researchers, engineers, and practitioners working in the ever-evolving landscape of NLP and autonomous systems.
Keywords: Large Language Models, Autonomous Agents, Evaluation, LLaMa, Generative Agents, LLMs, GPT
DOI: 10.54941/ahfe1004478
Cite this paper
More from this volume
- Automotive human‒machine interface to use like a peripersonal space through the elbow using vibrotactile stimulation
- Analysis of Physical Readiness for Take-Over in Automated Driving – Approach to Classify Non-Driving Related Activities According to Their Level of Complexity
- Navigating the challenges of remote operations of automated road vehicles: A socio-technical perspective
- Requirements for Haptic Virtual Training Systems in the Automotive Industry
- Olfactory Profile: Enhancing the Satisfaction and Pleasure of Ride-Hailing Experiences
- Exploring External Human Machine Interface Design for Autonomous Vehicle to Pedestrian Communication: Insights from Discussions and Drawing Sessions
- Participants' speed-accuracy trade-off behavior in high-stress situations in simulator studies
- Experimental study on the effect of micro-refresh during office work in VR space to restore intellectual concentration decline
- Cognitive User Modeling for Adaptivity in Serious Games
- Cognitive Systems Challenges of Virtual Reality (VR) and Simulated Air Traffic Control Environment (SATCE) in Flight Training: The Purdue Case Study
- First Probe into Frontal EEG Dynamic Cross-Entropy associated with Virtual Sexual Content
- The Neural Algebra and its Impact on Design and Test of Intelligent Systems


AHFE Open Access