A bilingual study of Multi-Word Expressions in Journalistic Texts: Fine-tune BERT with Head-Based Masking Technique

Christina Valavani; Stavros Giannakis

doi:10.54941/ahfe1007204

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

A bilingual study of Multi-Word Expressions in Journalistic Texts: Fine-tune BERT with Head-Based Masking Technique

Open Access

Article

Conference Proceedings

Authors: Christina Valavani, Stavros Giannakis

Abstract

Machine Translation systems combine the advantages of both Artificial Intelligence and Human Intelligence, yet achieving "human parity" requires overcoming persistent challenges in modeling complex linguistic structures. This study investigates the representation of financial Multi-Word Expressions (MWEs) for the German-Greek language pair, with the primary goal of improving modeling accuracy for Neural Machine Translation (NMT) systems. The presented errors in the translation of finance terminology serve as the background for different steps in the process of numerical representation (vectorization). Furthermore, special emphasis will be put on the computational modeling of special and general language in order to deal, inter alia, with financial language issues and terms. The study focuses on optimizing the numerical vectorization of Multiword Terms to solve the "Distributed Semantic Problem" often found in German separable verbs. By introducing a novel Head-Based Masking technique, we demonstrate a 56% improvement in semantic clustering compared to standard baselines. These results confirm that enhanced vector handling of MWEs provides a superior linguistic foundation, directly addressing a significant challenge for the next generation of precision-oriented Artificial Intelligence applications. The Head-Based Masking Technique and a 4-Component Embedding Architecture (E_token+P_intra+E_phrase+P_inter) improve the numerical representation of Multi-Word Expressions (MWEs) in German financial and journalistic texts while also enhance overall language representation. The main goal is to resolve the challenges associated with distributed semantic representation (e.g., Separable Verbs like brach... ein) by forcing the model to treat distant components as a single semantic unit, creating tighter vector clusters for domain-specific terminology. The evaluation script tests the model on a curated test set (defined in evaluation.py) containing 14 MWE pairs across four categories: Financial Causality, Functional Verbs, Separable Verbs, and Journalistic Phrasing.

Keywords: Word embedding, Neural machine translation, BERT, Double masking technique, AI

DOI: 10.54941/ahfe1007204

Cite this paper

Downloads

107

Visits

187

Download PDF