What AI brings to researchRAG … grounds every answer in verifiable sources and mitigates the factual errors and hallucinations seen in standalone LLMs. Gavrilova & Galli (2026), Evidence-Based Dentistry
13,061 papers · continuously updated · last export: 4 Jul 2026livingmeta.ai
Living Reviews

Living Reviews

Continuously updated evidence syntheses. Each review auto-updates as new papers are extracted.

1 living review·1,041 papers covered·updates weekly·last updated 28 June 2026

Hallucination Detection and Mitigation in AI-Generated Scientific Content

1,041 papers covered·6 contradictions·updated 28 June 2026

The corpus is large but heavily weighted toward narrative and conceptual reviews plus single-domain empirical audits, with relatively few rigorous, comparable experiments. A substantial cluster of empirical studies establishes that citation/reference fabrication is the most measured and best-documented form of hallucination in AI-generated scientific content, with measured rates spanning roughly 11% to 59% depending on model, prompt, and domain [W7156094403][W7138917524][W7134016756][W7140737014][W7159637053][W7138876490][W7163598450]. Large-scale bibliometric evidence indicates these fabrications have entered the published literature at scale, with a conservative estimate of ~146,932 hallucinated citations in 2025 alone [W7160968136]. A parallel conceptual literature converges on taxonomies of hallucination types (factual, citation, interpretive, contextual) and recurrent root causes in next-token probabilistic generation, training-data gaps, and lack of grounded reasoning [W7115903291][W7125472384][W7128549412][W7125480319][W7130643271][W7164570225-N/A]. On mitigation, retrieval-augmented generation (RAG) is the single most frequently endorsed strategy, often combined with post-generation verification, knowledge graphs, uncertainty/calibration methods, and human oversight [W7141298873][W7163002159][W7134893933][W4393160124][W7152624040], but multiple sources stress RAG is not a complete solution [W7163002159][W4409716856][W4399316968]. A newer, more technical strand proposes and benchmarks concrete detection systems—retrieval-grounded citation verifiers, graph-consistency checks, neuron-level localization, rejection sampling, and model-agnostic risk scoring [W7162817986][W7160847460][W7163598450][W7155452254][W7164090112][W7163595596][W7165485900][W7138917524]. Cross-cutting framing extends to governance, research-misconduct classification, and human-in-the-loop verification labor [W7136066094][W4390357060][W7165663373][W7164915559-N/A]. Overall the field is descriptively rich and rapidly growing, but suffers from benchmark fragmentation, inconsistent definitions, and few head-to-head mitigation comparisons [W7141298873][W7164458772][W7160181021].

Read the living review →