Towards a Proper Evaluation of Automated Conversational Systems

Open Access
Conference Proceedings
Authors: Abraham SandersMara SchwartzAlbert ChangShannon BriggsJonas BraaschDakuo WangMei SiTomek Strzalkowski

Abstract: Efficient evaluation of dialogue agents is a major problem in conversational AI, with current research still relying largely on human studies for method validation. Recently, there has been a trend toward the use of automatic self-play and bot-bot evaluation as an approximation for human ratings of conversational systems. Such methods promise to alleviate the time and financial costs associated with human evaluation, and current proposed methods show moderate to strong correlation with human judgements. In this study, we further investigate the fitness of end-to-end self-play and bot-bot interaction for dialogue system evaluation. Specifically, we perform a human study to confirm self-play evaluations of a recently proposed agent that implements a GPT-2 based response generator on the Persuasion For Good charity solicitation task. This agent leverages Progression Function (PF) models to predict the evolving acceptability of an ongoing dialogue and uses dialogue rollouts to proactively simulate how candidate responses may impact the future success of the conversation. The agent was evaluated in an automatic self-play setting, using automatic metrics to estimate sentiment and intent to donate in each simulated dialogue. This evaluation indicated that sentiment and intent to donate were higher (p < 0.05) across dialogues involving the progression-aware agents with rollouts, compared to a baseline agent with no rollout-based planning mechanism. To validate the use of self-play in this setting, we follow up by conducting a human evaluation of this same agent on a range of factors including convincingness, aggression, competence, confidence, friendliness, and task utility on the same Persuasion For Good solicitation task. Results show that human users agree with previously reported automatic self-play results with respect to agent sentiment, specifically showing improvement in friendliness and confidence in the experimental condition; however, we also discover that for the same agent, humans reported a lower desire to use it in the future compared to the baseline. We perform a qualitative sentiment analysis of participant feedback to explore possible reasons for this, and discuss implications for self-play and bot-bot interaction as a general framework for evaluating conversational systems.

Keywords: Dialogue System Evaluation, Dialogue Agent, Dialogue Planning, Conversational Artificial Intelligence, Natural Language Processing

DOI: 10.54941/ahfe1003276

Cite this paper: