Parallelising 2D-CNNs and transformers: A Cognitive-based approach for Automatic Recognition of Learners’ English Proficiency

Open Access
Conference Proceedings
Authors: Meishu SongEmilia Parada-CabaleiroZijiang YangXin JingKazumasa TogamiKun QianBjörn SchullerYamamoto Yoshiharu

Abstract: Learning English as a foreign language requires an extensiveuse of cognitive capacity, memory, and motor skills in order to orallyexpress one’s thoughts in a clear manner. Current speech recognition in-telligence focuses on recognising learners’ oral proficiency from fluency,prosody, pronunciation, and grammar’s perspectives. However, the ca-pacity of clearly and naturally expressing an idea is a high level cognitivebehaviour which can hardly be represented by these detailed and segmen-tal dimensions, which indeed do not fulfil English learners and teachersrequirements. This work aims to utilise the state-of-the-art deep learningtechniques to recognise English speaking proficiency at a cognitive level,i. e., a learner’s ability to clearly organise their own thoughts when ex-pressing an idea in English as a foreign language. For this, we collectedthe “Oral English for Japanese Learners” Dataset (OEJL-DB), a corpusof recordings by 82 students of a Japanese high school expressing theirideas in English towards 5 different topics. Annotations concerning theclarity of learners’ thoughts are given by 5 English teachers according to2 classes: clear and unclear. In total, the dataset includes 7.6 hours ofaudio data with an average length for each oral English presentation of66 seconds. As initial cognitive-based method to identify learners’ speak-ing proficiency, we propose an architecture based on the paralelizationof CNNs and Transformers. With the strengthening of the CNNs in spa-tial feature representation and the Transformer in sequence encoding,we achieve a 89.4 % accuracy and 87.6 % Unweighted Average Recall(UAR), results which outperform those from the ResNet architectures(89.2 % accuracy and 86.3 % UAR). Our promising outcomes reveal thatspeech intelligence can be efficiently applied to “grasp” high level cog-nitive behaviours, a new area of research which seems to have a greatpotential for further investigation.

Keywords: Speech Intelligence, Transformer, English Proficiency

DOI: 10.54941/ahfe1001000

Cite this paper: