Reliable Information Retrieval with LLMs: Automated Analysis and Comparison of large PDF Documents
Open Access
Article
Conference Proceedings
Authors: Lara Noe, Sebastian Breier, Ruben Nuredini
Abstract: In high-stakes professional settings like the insurance industry, extracting information from complex PDF documents is critical yet challenging due to document length and technical language. While large language models (LLMs) offer new opportunities for automating document understanding, their usefulness depends on output accuracy, transparency, and verifiability. In critical contexts, users must be able to trace information back to credible sources to reduce risks from hallucinations or misinterpretation. This research explores how LLMs can support reliable, transparent information retrieval (IR) from complex documents. We introduce five IR system variants designed to iteratively improve LLM outputs through better context preservation and fine-grained source attribution. These systems incorporate enhancements such as Markdown-based parsing, retrieval-augmented generation (RAG), and advanced document preprocessing. The final iteration integrates Multi-view Content-aware (MC) indexing, which supports semantically targeted retrieval using keyword, summary, and raw-text views. To evaluate performance, we develop a domain-specific benchmark with curated insurance documents, ground-truth answers, as well as performance metrics for assessment of the answer accuracy, hallucination rate, and source attribution precision. Results show that systems with content-aware chunking or MC-Indexing outperform earlier versions in accuracy and attribution, though with added complexity. Our findings highlight the value of structure-preserving preprocessing, targeted retrieval, and source transparency in developing trustworthy AI tools for document analysis. Future work may explore automated verification loops and user-guided retrieval to improve interpretability and reliability further.
Keywords: Large language models, Information retrieval, Document understanding, Retrieval augmented generation
DOI: 10.54941/ahfe1006772
Cite this paper:
Downloads
9
Visits
59