Self-Hosted Rag For Institutional Faqs: Two-Level Abstention, Reliability, and Cpu-Only Performance
Large Language Model (LLM)-based chatbots are employed to streamline the handling of frequently asked questions in higher education; however, their institutional adoption is constrained by the risk of ungrounded responses (hallucinations) and by the need for traceability and quality control in sensitive domains. This study presents a domain-restricted chatbot prototype based on a Retrieval-Augmented Generation (RAG) architecture, incorporating semantic retrieval over publicly available institutional documentation and response generation through a self-hosted LLM, with control mechanisms that enable abstention when sufficient evidence is unavailable. The evaluation comprises (i) quantitative testing on 451 queries (351 in-scope and 100 out-of-scope), (ii) end-to-end latency measurement by percentiles and load testing, and (iii) user acceptance assessment (n = 80). The system achieves Macro-F1 = 0.8212, correctness (answered in-scope) = 89.81\% (291/324; excluding 27/351 in-scope abstentions), 4.17\% hallucinations (evaluable generated responses, n = 336), abstention recall = 88\%, in-scope non-rejection rate = 92.31\%, and P95 latency = 4.67 s. Overall, the results support the feasibility of a RAG-based approach with abstention control to enable the automation of institutional queries with adequate operational metrics in a higher education context.
