Magenta: Metrics and Evaluation Framework for Generative Agents based on LLMs
Open Access
Article
Conference Proceedings
Authors: Sudarshan Kamath Barkur, Pratik Sitapara, Sven Leuschner, Sigurd Schacht
Abstract: Large Language Models (LLMs) have emerged as a driving force in the field of Natural Language Processing (NLP) with applications spanning various domains, including the development of Autonomous Generative Agents. Generative Agents are computational software programs designed to believably simulate human behavior by harnessing the capabilities of large language models. Through repetitive prompts against the large language model, these agents operate based on a system architecture consisting of memory streams, reflection, and planning, allowing them to store experiences, learn from them, and translate insights into high-level action plans to interact with their environment. This paper discusses the current landscape of language models and autonomous agents, their advantages and challenges, and the current state of evaluation, and proposes an innovative evaluation benchmark designed to provide a holistic perspective on their performance. Additionally, we see the impact of fine-tuning such an LLM, evaluate using our benchmark, and then propose a framework for evaluation of both the agents and their underlying LLMs. The existing frameworks for evaluating LLMs and autonomous agents focus on single tasks and are limited in capturing their capabilities. We outline the methodology for evaluating autonomous agents' performance in responding to single and multi-step prompts. The process consists of three key stages: Preparation of the data, Preparation of the Gold Answers, and Evaluations. We use the meticulously crafted 20 unique prompts to challenge the agents, covering simple and complex questions. Using GPT-4, a state-of-the-art model, we generate the initial responses, which undergo rigorous verification to produce gold answers, indicating correctness and revealing the minimum steps required for task completion. Our evaluation framework relies on two critical metrics: the effort metrics, quantifying the steps taken by autonomous agents, and the success rate, measuring their accuracy in achieving task objectives and also keeping track of hallucinations of the model. We conduct experiments with ten different models, representing the current landscape of natural language processing models, presenting each with 20 unique prompts. Their responses are meticulously compared to our gold answers and gold steps (optimal number of steps) to generate the evaluation metrics. Similarly, a fine-tuned model is also evaluated with ten different questions, which test the agent's decision-making process by selecting the correct tool and then the ability of the model to reach the correct conclusion to the question asked by the user in this process.This comprehensive approach ensures a thorough assessment of autonomous agents' capabilities. It demonstrates the utility of these metrics, revealing how they can shed light on the strengths and weaknesses of various autonomous agents. As a step toward standardization, we propose transforming the evaluation process of LLMs into an automated framework that accommodates all types of language models, agents, and LLM-based applications. Such an approach promises to establish a unified and comprehensive evaluation methodology, empowering users to make informed decisions when selecting, fine-tuning, and assessing the accuracy of underlying language models and their applications for different domains.In summary, this paper contributes to the ongoing research on evaluating LLMs and autonomous agents by introducing a novel benchmark and proposing a framework, focusing on evaluating the language models while keeping different knowledge domains in mind. Our framework will enhance our understanding of these technologies and serve as a valuable resource for researchers, engineers, and practitioners working in the ever-evolving landscape of NLP and autonomous systems.
Keywords: Large Language Models, Autonomous Agents, Evaluation, LLaMa, Generative Agents, LLMs, GPT
DOI: 10.54941/ahfe1004478
Cite this paper:
Downloads
177
Visits
795