A bilingual study of Multi-Word Expressions in Journalistic Texts: Fine-tune BERT with Head-Based Masking Technique
Abstract
Machine Translation systems combine the advantages of both Artificial Intelligence and Human Intelligence, yet achieving "human parity" requires overcoming persistent challenges in modeling complex linguistic structures. This study investigates the representation of financial Multi-Word Expressions (MWEs) for the German-Greek language pair, with the primary goal of improving modeling accuracy for Neural Machine Translation (NMT) systems. The presented errors in the translation of finance terminology serve as the background for different steps in the process of numerical representation (vectorization). Furthermore, special emphasis will be put on the computational modeling of special and general language in order to deal, inter alia, with financial language issues and terms. The study focuses on optimizing the numerical vectorization of Multiword Terms to solve the "Distributed Semantic Problem" often found in German separable verbs. By introducing a novel Head-Based Masking technique, we demonstrate a 56% improvement in semantic clustering compared to standard baselines. These results confirm that enhanced vector handling of MWEs provides a superior linguistic foundation, directly addressing a significant challenge for the next generation of precision-oriented Artificial Intelligence applications. The Head-Based Masking Technique and a 4-Component Embedding Architecture (E_token+P_intra+E_phrase+P_inter) improve the numerical representation of Multi-Word Expressions (MWEs) in German financial and journalistic texts while also enhance overall language representation. The main goal is to resolve the challenges associated with distributed semantic representation (e.g., Separable Verbs like brach... ein) by forcing the model to treat distant components as a single semantic unit, creating tighter vector clusters for domain-specific terminology. The evaluation script tests the model on a curated test set (defined in evaluation.py) containing 14 MWE pairs across four categories: Financial Causality, Functional Verbs, Separable Verbs, and Journalistic Phrasing.
Keywords: Word embedding, Neural machine translation, BERT, Double masking technique, AI
DOI: 10.54941/ahfe1007204
Cite this paper
More from this volume
- Teaching Multimodal Interaction in Cars to First-time Users
- On Immersivity of Transmitted Spatial Sounds for Human-Machine Interaction
- Human-Centered optimization through Digital Twins, and Motion Capture Technologies of a manual activity in the logistics sector
- Exploring Empathy for Emotion-Aware Vehicles: How Should a Car Respond?
- Enhancing Usability in Crisis Management Training: Evaluation of the Virtual Reality-Based Situational Awareness Table
- Formal Verification for Human-Centred Trust in AI: A Critical Examination of Current Paradigms
- Designing Inclusive Mobile Government Services in the Middle East: A User Experience–Centered Framework
- Capturing Food Culture for Adaptive AI: Generative Insights from a Multimodal Profiling Study
- A methodical approach to AI-supported human learning in complex task environments
- Glossary as a Compass: Domain Knowledge Artifacts in Human-Centered AI Development
- Fiscolab: Co-Creation, Artificial Intelligence, and User-Centered Design in the Development of Educational Fiscal Solutions
- Thinking With AI: Human–AI Interaction and Critical Thinking in Scenario-Based Learning


AHFE Open Access