Knowledge Graph-Enhanced Large Language Model Framework for Privacy-Preserving Document Processing in the AEC Domain

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

Knowledge Graph-Enhanced Large Language Model Framework for Privacy-Preserving Document Processing in the AEC Domain

Open Access

Article

Conference Proceedings

Authors: Fan Yang, Hazar Nicholas Dib, Jiansong Zhang

Abstract: Data privacy and safety are critical concerns for companies in the Architecture, Engineering, and Construction (AEC) domain, which routinely handle sensitive textual data such as design criteria, project specifications, and compliance records. Protecting this information is vital for maintaining competitive advantage, meeting legal requirements, and ensuring safety and accountability. However, processing such domain-specific data is challenging. Rule-based systems require extensive manual rule sets, while supervised machine learning models need large, annotated datasets - both of which limit scalability and applicability in AEC contexts. Recent advances in large language models (LLMs) offer a promising alternative due to their ability to perform natural language tasks with minimal supervision. Yet, general-purpose LLMs pose two major concerns: they may generate inaccurate or irrelevant outputs on technical content, and their reliance on online services introduces significant privacy risks. To address these issues, this paper proposes a knowledge graph-enhanced LLM framework designed for local, privacy-preserving processing of sensitive AEC documents. Using the 2015 International Building Code (IBC) as an example, the framework operates in two stages. First, an LLM converts selected IBC chapters into a structured knowledge graph with 234 entities, 131 relationships, and 8 communities. Second, another LLM retrieves relevant context from the graph to generate accurate query responses. The system employs open-source models - nomic-embed-text for text embeddings and deepseek-r1 for context retrieval and generation. Evaluation using 661 query-answer-context records showed an average semantic similarity score of 0.83 and an average answer relevancy score of 0.71, indicating high accuracy and contextual alignment. The system runs entirely on a standalone machine, preserving full data privacy and incurring no cost. This work demonstrates a secure and effective approach for using LLMs in privacy-sensitive, domain-specific applications and lays the foundation for broader adoption in similar fields.

Keywords: Knowledge Graph, Building Code Interpretation, Large Language Models, Retrieval Augmented Generation, Data Privacy

DOI: 10.54941/ahfe1006921

Cite this paper:

Downloads

172

Visits

282