A Unified Multimodal Pipeline for Luxembourgish Language Learning: Curriculum-Grounded Retrieval and LAM-Driven Interaction

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

A Unified Multimodal Pipeline for Luxembourgish Language Learning: Curriculum-Grounded Retrieval and LAM-Driven Interaction

Open Access

Article

Conference Proceedings

Authors: Hedi Tebourbi, Sana Nouzri, Piotr Kluczynski, Yazan Mualla, Abdeljalil Abbas- Turki

Abstract: We introduce a unified multimodal system that transforms the official INL Luxembourgish textbook into an interactive, AI-driven tutoring environment. The work combines two complementary components: (1) a full pipeline for digitizing, structuring, and retrieving textbook exercises, and (2) a Large Action Model (LAM)-based interaction layer enabling an LLM tutor to surface relevant visuals dynamically during conversation.The pipeline begins with a semi-automated exercise extraction stage, where textbook pages are processed through a computer vision and GPT-4.1 vision–assisted workflow to isolate individual exercises, followed by targeted manual correction when needed. These extracted images are then passed through an OCR and cleanup stage, segmented into coherent units, enriched with metadata (chapter, theme, exercise type), and embedded into a vector store. The result is a structured and searchable curriculum-grounded knowledge base.On top of this foundation, the Image Retrieval Tool, implemented through LangGraph and OpenAI function calling, enables the tutor agent to fetch pedagogically aligned visuals based on ongoing dialogue. The system fuses text generation with image retrieval, allowing the tutor to present exercises, illustrations, and contextually relevant content directly within the learner interface. This architecture ensures that visuals appear naturally during tutoring sessions without manual selection or preloading.Evaluation demonstrates that the retrieval mechanism consistently returns accurate, relevant images aligned with the student’s topic of study. Latency analysis indicates that performance bottlenecks arise mainly from LLM generation rather than retrieval or rendering. User perception studies confirm improved clarity, engagement, and trust when multimodal elements are integrated into the learning interaction.Together, the merged system offers a practical blueprint for curriculum-grounded, multimodal language learning and highlights future directions such as automated dataset scaling, richer metadata structuring, and interactive exercise formats built on top of the existing pipeline.

Keywords: Multimodal AI tutoring, LAM interaction, RAG retrieval, human-centered design, Luxembourgish language learning

DOI: 10.54941/ahfe1007128

Cite this paper:

Downloads

1

Visits

2