The Consistency between Popular Generative Artificial Intelligence (AI) Robots in Evaluating the User Experience of Mobile Device Operating Systems

Open Access
Article
Conference Proceedings
Authors: Victor K Y Chan

Abstract: This article attempts to study the consistency, among other auxiliary comparisons, between popular generative artificial intelligence (AI) robots in the evaluation of various perceived user experience dimensions of mobile device operating system versions or, more specifically, iOS and Android versions. A handful of robots were experimented with, ending up with Dragonfly and GPT-4 being the only two eligible for in-depth investigation where the duo was individually requested to accord rating scores to the six major dimensions, namely (1) efficiency, (2) effectiveness, (3) learnability, (4) satisfaction, (5) accessibility, and (6) security, of the operating system versions. It is noteworthy that these dimensions are from the perceived user experience’s point of view instead of any “physical” technology’s standpoint. For each of the two robots, the minimum, the maximum, the range, and the standard deviation of the rating scores for each of the six dimensions were computed across all the versions. The rating score difference for each of the six dimensions between the two robots was calculated for each version. The mean of the absolute value, the minimum, the maximum, the range, and the standard deviation of the differences for each dimension between the two robots were calculated across all versions. A paired sample t-test was then applied to each dimension for the rating score differences between the two robots over all the versions. Finally, a correlation coefficient of the rating scores was computed for each dimension between the two robots across all the versions. These computational outcomes were to confirm whether the two robots awarded discrimination in evaluating each dimension across the versions, whether any of the two robots systematically underrated or overrated any dimension vis-à-vis the other robot, and whether there was consistency between the two robots in evaluating each dimension across the versions. It was found that discrimination was apparent in the evaluation of all dimensions, GPT-4 systematically underrated the dimensions satisfaction (p = 0.002 < 0.05) and security (p = 0.008 < 0.05) compared with Dragonfly, and the evaluation by the two robots was almost impeccably consistent for the six dimensions with the correlation coefficients ranging from 0.679 to 0.892 (p from 0.000 to 0.003 < 0.05). Consistency implies at least the partial trustworthiness of the evaluation of these mobile device operating system versions by either of these two popular generative AI robots based on the analogous concept of convergent validity.

Keywords: Artificial intelligence, robots, perceived user experience, mobile device operating system versions

DOI: 10.54941/ahfe1004193

Cite this paper:

Downloads
136
Visits
331
Download