Towards a Proper Evaluation of Automated Conversational Systems
Abstract
Efficient evaluation of dialogue agents is a major problem in conversational AI, with current research still relying largely on human studies for method validation. Recently, there has been a trend toward the use of automatic self-play and bot-bot evaluation as an approximation for human ratings of conversational systems. Such methods promise to alleviate the time and financial costs associated with human evaluation, and current proposed methods show moderate to strong correlation with human judgements. In this study, we further investigate the fitness of end-to-end self-play and bot-bot interaction for dialogue system evaluation. Specifically, we perform a human study to confirm self-play evaluations of a recently proposed agent that implements a GPT-2 based response generator on the Persuasion For Good charity solicitation task. This agent leverages Progression Function (PF) models to predict the evolving acceptability of an ongoing dialogue and uses dialogue rollouts to proactively simulate how candidate responses may impact the future success of the conversation. The agent was evaluated in an automatic self-play setting, using automatic metrics to estimate sentiment and intent to donate in each simulated dialogue. This evaluation indicated that sentiment and intent to donate were higher (p < 0.05) across dialogues involving the progression-aware agents with rollouts, compared to a baseline agent with no rollout-based planning mechanism. To validate the use of self-play in this setting, we follow up by conducting a human evaluation of this same agent on a range of factors including convincingness, aggression, competence, confidence, friendliness, and task utility on the same Persuasion For Good solicitation task. Results show that human users agree with previously reported automatic self-play results with respect to agent sentiment, specifically showing improvement in friendliness and confidence in the experimental condition; however, we also discover that for the same agent, humans reported a lower desire to use it in the future compared to the baseline. We perform a qualitative sentiment analysis of participant feedback to explore possible reasons for this, and discuss implications for self-play and bot-bot interaction as a general framework for evaluating conversational systems.
Keywords: Dialogue System Evaluation, Dialogue Agent, Dialogue Planning, Conversational Artificial Intelligence, Natural Language Processing
DOI: 10.54941/ahfe1003276
Cite this paper
More from this volume
- A machine learning approach for optimizing waiting times in a hand surgery operation center
- Automated Decision Support for Collaborative, Interactive Classification
- Dynamically monitoring crowd-worker's reliability with interval-valued labels
- Perceptions, attitudes and trust toward artificial intelligence — An assessment of the public opinion
- Artificial Empathy: Exploring the Intersection of Digital Art and Emotional Responses to the COVID-19 Pandemic
- Machine Reading Comprehension and Expert System technologies for social innovation in the drug excipient selection process
- Image Caption Generation of Arts: Review and Outlook
- Automated Visual Story Synthesis with Character Trait Control
- Does Imageable Language Make Your Tweets More Persuasive?
- Emotional Analysis of Candidates During Online Interviews
- Emotion Recognition from Speech via the Use of Different Audio Features, Machine Learning and Deep Learning Algorithms
- Evaluating the Effect of Time on Trust Calibration of Explainable Artificial Intelligence


AHFE Open Access