Impact of Retrievers On Rag System Performance For Question Answering

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their deployment in production systems remains constrained by hallucination, knowledge cutoff, and limited source attribution. Retrieval-Augmented Generation (RAG) addresses these issues by grounding model outputs in external evidence. However, RAG effectiveness critically depends on retriever quality, a component that remains insufficiently explored in controlled comparative studies. This work presents a systematic experi- mental evaluation of four representative retriever architectures - BM25, DPR, BGE, and ColBERT - across multiple RAG configurations. Us- ing three modern open-source LLMs and standardized RAGAS met- rics, we analyze performance, robustness, and computational trade-offs over 307,373 queries question-answer pairs from the Natural Questions dataset. Results demonstrate that retriever selection impacts RAG per- formance by up to ±20%, while LLM backbone choice shows marginal variation. ColBERT achieves the highest overall accuracy, while BGE offers the best accuracy–latency trade-off. We conclude with practical, evidence-based guidelines for retriever selection under different applica- tion constraints, emphasizing retrieval as a a first-class design consider- ation within this experimental setting in RAG systems.

Carlos Augusto Oeiras
UFPA
Brazil

Rafhael da Silva Monteiro
UFPA
Brazil

Tiago Davi Oliveira de Araújo
ESAN-IEETA-UA
Portugal

Jefferson Magalhães De Morais
UFPA
Brazil