A bilingual study of Multi-Word Expressions in Journalistic Texts: Fine-tune BERT with Head-Based Masking Technique

Open Access
Article
Conference Proceedings
Authors: Christina ValavaniStavros Giannakis

Abstract: Machine Translation systems combine the advantages of both Artificial Intelligence and Human Intelligence, yet achieving "human parity" requires overcoming persistent challenges in modeling complex linguistic structures. This study investigates the representation of financial Multi-Word Expressions (MWEs) for the German-Greek language pair, with the primary goal of improving modeling accuracy for Neural Machine Translation (NMT) systems. The presented errors in the translation of finance terminology serve as the background for different steps in the process of numerical representation (vectorization). Furthermore, special emphasis will be put on the computational modeling of special and general language in order to deal, inter alia, with financial language issues and terms. The study focuses on optimizing the numerical vectorization of Multiword Terms to solve the "Distributed Semantic Problem" often found in German separable verbs. By introducing a novel Head-Based Masking technique, we demonstrate a 56% improvement in semantic clustering compared to standard baselines. These results confirm that enhanced vector handling of MWEs provides a superior linguistic foundation, directly addressing a significant challenge for the next generation of precision-oriented Artificial Intelligence applications. The Head-Based Masking Technique and a 4-Component Embedding Architecture (E_token+P_intra+E_phrase+P_inter) improve the numerical representation of Multi-Word Expressions (MWEs) in German financial and journalistic texts while also enhance overall language representation. The main goal is to resolve the challenges associated with distributed semantic representation (e.g., Separable Verbs like brach... ein) by forcing the model to treat distant components as a single semantic unit, creating tighter vector clusters for domain-specific terminology. The evaluation script tests the model on a curated test set (defined in evaluation.py) containing 14 MWE pairs across four categories: Financial Causality, Functional Verbs, Separable Verbs, and Journalistic Phrasing.

Keywords: Word embedding, Neural machine translation, BERT, Double masking technique, AI

DOI: 10.54941/ahfe1007204

Cite this paper:

Downloads
0
Visits
1
Download