Skip to main content
OpenConf small logo

Providing all your submission and review needs
Abstract and paper submission, peer-review, discussion, shepherding, program, proceedings, and much more

Worldwide & Multilingual
OpenConf has powered thousands of events and journals in over 100 countries and more than a dozen languages.


ZIP
0.4MB

Quantifying Reference Hallucinations In Llm-Generated Systematic-Review Writing: Effects of Retrieval and Hard Grounding In The Synergy Benchmark

Large language models (LLMs) are increasingly used to draft scientific text for evidence synthesis, yet hallucinated bibliographic references remain a major threat to verifiability and reproducibility. We quantify reference hallucinations in LLM-generated Background/Related Work prose for systematic reviews by measuring whether generated citations are grounded in a review-specific closed evidence set, whether provided bibliographic identifiers are valid, and whether the model can reliably produce valid structured outputs. Using 10 topics from the SYNERGY benchmark, we evaluated two Gemini models under a controlled multi-run design and three conditions: (A) no retrieval, (B) managed retrieval via Gemini File Search over a topic-specific corpus containing only included studies, and (C) retrieval with a hard "cite-only-from-corpus" constraint requiring an internal document identifier and evidence snippet per reference. We score (i) in-corpus citation rate against the gold included set, (ii) DOI/PMID invalidation rate via Crossref and PubMed, (iii) identifier abstention (UNKNOWN) rate, (iv) constraint violations under strict grounding, and (v) structured-output validity. Across both models, retrieval sharply improves citation grounding relative to no retrieval, and strict grounding yields perfect in-corpus citation behavior among valid outputs with zero constraint violations. However, end-to-end reliability depends not only on grounding but also on structural output compliance: Gemini 3.1 Pro Preview produces valid structured outputs more consistently and maintains a stronger reliability profile than Gemini 3 Flash Preview under the same controlled setup. These findings show that managed retrieval substantially mitigates reference hallucinations, while strict grounding and robust structured-output compliance are both necessary for dependable SR-facing scientific writing systems.

Carlos Zapata
Center for Research in Mathematics
Mexico

Jezreel Mejía-Miranda
Center for Research in Mathematics
Mexico

Mirna Muñoz
Center for Research in Mathematics
Mexico