Living review · live

Hallucination Detection and Mitigation in AI-Generated Scientific Content

1,041 papers covered · updated 28 June 2026 · updates weekly

AI-generated living evidence synthesis — not a peer-reviewed systematic review. Methods below.

What changed

Baseline established — 1,041 papers covered, 6 open contradictions identified.

State of the field

The corpus is large but heavily weighted toward narrative and conceptual reviews plus single-domain empirical audits, with relatively few rigorous, comparable experiments. A substantial cluster of empirical studies establishes that citation/reference fabrication is the most measured and best-documented form of hallucination in AI-generated scientific content, with measured rates spanning roughly 11% to 59% depending on model, prompt, and domain [W7156094403][W7138917524][W7134016756][W7140737014][W7159637053][W7138876490][W7163598450]. Large-scale bibliometric evidence indicates these fabrications have entered the published literature at scale, with a conservative estimate of ~146,932 hallucinated citations in 2025 alone [W7160968136]. A parallel conceptual literature converges on taxonomies of hallucination types (factual, citation, interpretive, contextual) and recurrent root causes in next-token probabilistic generation, training-data gaps, and lack of grounded reasoning [W7115903291][W7125472384][W7128549412][W7125480319][W7130643271][W7164570225]. On mitigation, retrieval-augmented generation (RAG) is the single most frequently endorsed strategy, often combined with post-generation verification, knowledge graphs, uncertainty/calibration methods, and human oversight [W7141298873][W7163002159][W7134893933][W4393160124][W7152624040], but multiple sources stress RAG is not a complete solution [W7163002159][W4409716856][W4399316968]. A newer, more technical strand proposes and benchmarks concrete detection systems—retrieval-grounded citation verifiers, graph-consistency checks, neuron-level localization, rejection sampling, and model-agnostic risk scoring [W7162817986][W7160847460][W7163598450][W7155452254][W7164090112][W7163595596][W7165485900][W7138917524]. Cross-cutting framing extends to governance, research-misconduct classification, and human-in-the-loop verification labor [W7136066094][W4390357060][W7165663373][W7164915559]. Overall the field is descriptively rich and rapidly growing, but suffers from benchmark fragmentation, inconsistent definitions, and few head-to-head mitigation comparisons [W7141298873][W7164458772][W7160181021].

Key findings

Citation fabrication is pervasive and rate varies enormously by model and prompt: empirical audits report ~11-22% [W7156094403], ~12-21% across GPT-4o/Claude/LLaMA [W7138917524], and an up-to-fivefold cross-model spread of 11.4% (GPT-5-mini) to 56.8% (haiku-4.5) [W7134016756], with one medical study finding 58.5% of 1,100 ChatGPT citations fabricated [W7140737014].
Hallucinated citations are not random but structured, patterned recombinations of real authors, journals, and keywords, with duplication in nearly 30% of cases [W7155247453] and 100% of sampled NeurIPS fabrications exhibiting compound deception (66% total fabrications invented wholesale) [W7128096715].
Citation generation may be an induced rather than intrinsic behavior: no model spontaneously produced formal citations when unprompted (0 of 3,030 responses), and prompts for 'recent and influential' references yielded higher fabrication (74.1%) than 'seminal' ones (55.0%) [W7134016756].
Multi-model consensus is a useful signal: agreement among three or more LLMs sharply raises the probability that a citation is real [W7134016756].
RAG reduces but does not eliminate hallucination—commercial RAG-based legal tools still hallucinate 17-33% despite 'hallucination-free' marketing claims [W4409716856][W4399316968], and RAG failures arise across query formulation, retrieval, evidence aggregation, and grounding [W7163002159].
Hybrid and retrieval-grounded detection pipelines perform strongly: a database+fuzzy+LLM pipeline reached ~80% precision (15-20% over database-only) [W7138917524]; CiteCheck reached 88.7 macro-F1 outperforming GPT/Claude/Gemini baselines [W7162817986]; a Citation-Grounding legal graph fine-tune reached 98.5% validation accuracy [W7163598450].
Hybrid mitigation stacks outperform single methods: an RLHF+RAG hybrid achieved 91.5% accuracy versus individual methods, against benchmark hallucination rates of 18.7-34.2% [W7134893933]; benchmarking shows retrieval methods improve grounding at latency cost while prompt-based methods are lightweight but less robust [W7160181021].
Hallucination is increasingly framed as a family of failures across model, pipeline, and human-governance levels rather than a single phenomenon, requiring layered mitigation [W7141298873][W7152624040][W7164450790].
Hallucination signals are field-specific: author-name fields fail most, and probes trained on one bibliographic field transfer near-chance to others, yet targeted neuron suppression reduces fabrication without external retrieval [W7155452254].
Human oversight is repeatedly identified as indispensable, with citations potentially constituting research misconduct when they function as data [W7136066094], human-based factuality assessment recommended before publication [W4388642569], and 'Verify-and-Validate' / hybrid human-AI workflows proposed [W7163028587][W7162543742][W7165663373].
Model rankings on citation accuracy differ markedly across domains: DeepSeek achieved 92.0% accuracy versus ChatGPT's 19.4% in glaucoma [W7140602646], while ChatGPT and Perplexity showed significantly lower hallucination severity than Gemini and DeepSeek in dental trauma [W7164915559].

Contradictions

Whether citation hallucination is intrinsic versus induced: one large audit argues fabrication is prompt-induced (zero spontaneous citations unprompted) [W7134016756], whereas medical and review sources treat hallucination as a structural/intrinsic property of autoregressive generation [W7152624040][W7140737014].
Investigate in the Lab →
Whether domain-specialized fine-tuning helps or hurts: a medical-imaging review finds general-purpose models outperform medical-specialized models due to overfitting-induced confabulation [W7164700945], while a glaucoma study found the 'biomedically enriched' DeepSeek most accurate [W7140602646].
Investigate in the Lab →
Inconsistent model performance rankings across studies: DeepSeek best and ChatGPT worst in one domain [W7140602646], but ChatGPT/Perplexity superior to DeepSeek/Gemini in another [W7164915559], and yet another reports Grok and DeepSeek outperforming ChatGPT [W7159637053]—no stable cross-domain ordering.
Investigate in the Lab →
Whether structural/graph-topology signals reliably detect hallucination: graph-based detectors are presented as effective [W7160847460][W7163598450][W4393160124], but evidence-graph analysis shows the approach reverses for stronger models (GPT-4 hallucinations score higher than grounded answers), making detection effectiveness model-dependent [W7164090112].
Investigate in the Lab →
Whether RAG is an adequate remedy: it is widely endorsed as the primary mitigation [W7141298873][W7130643271][W7134893933], yet directly contradicted by evidence that RAG-based commercial tools still hallucinate 17-33% and should not be treated as a complete solution [W4409716856][W7163002159].
Investigate in the Lab →
Whether AI-generated scholarly outputs can rival human quality: an automated systematic-review study found expert reviewers preferred semi-automated and fully-automated reviews over the human review [W7131073722], conflicting with the dominant narrative of AI unreliability for scholarly synthesis [W7140737014][W4417503485].
Investigate in the Lab →

Open gaps

The most prevalent and harmful variant—real citations deployed to support claims the source does not actually make (semantic/contextual hallucination)—remains largely undetected by current title-matching pipelines [W7160968136][W7128096715][W4417503485].
Investigate in the Lab →
Benchmark fragmentation and inconsistent definitions/metrics prevent reliable cross-study comparison of mitigation effectiveness [W7141298873][W7164458772][W7160181021][W7115903291].
Investigate in the Lab →
Lack of domain-specific datasets and shared benchmarks for scholarly hallucination beyond initiatives like SciHal25 [W7115903291][W7125472384].
Investigate in the Lab →
Distinguishing genuine fabrication from contamination inheritance (reproduction of pre-existing erroneous citations in training data) and tracing provenance [W7128096715].
Investigate in the Lab →
Detection methods generalize poorly across bibliographic fields, model architectures, and domains, with most validation on narrow samples (e.g., physics, single model families) [W7155452254][W7162817986][W7164090112].
Investigate in the Lab →
Verification methods are weak or absent for less-structured domains lacking citation indices (government, legal, clinical documentation) [W7160968136].
Investigate in the Lab →
Human factors are under-studied: user skepticism and verification behavior are rarely measured separately from reliance, and the most deployable interventions (hallucination warnings) show weak/mixed effects [W7165663373].
Investigate in the Lab →
Few longitudinal or naturalistic studies track how hallucination patterns evolve across model generations or how repeated exposure affects detection [W7128096715][W7141298873].
Investigate in the Lab →
Need for harm-weighted/consequence-adjusted evaluation, calibrated uncertainty, multilingual generalisability, and operational governance/regulatory standards [W7152624040][W7141298873][W7164700945][W4414644854].
Investigate in the Lab →

Methods & scope

Inclusion: cosine ≥ 0.5 to the agenda-item embedding, relevance ≥ 6; synthesised from the top 50 best-evidenced papers. Membership is frozen (same query each cycle); the synthesis is regenerated incrementally and fully rebaselined periodically.

Synthesis by Claude Opus 4.8; membership gating by Claude Haiku.

Revision history

v128 June 2026baseline
baseline 1,041 papers

← All living reviews