Back to blog

Responsible AI in Systematic Reviews: What the 2025 Cochrane Position and RAISE Actually Require

A working guide to the 2025 Cochrane AI position statement and the RAISE recommendations: what they require for transparency, validation, and reproducibility in evidence synthesis — and how to operationalize them in a 2026 review.

Mapped Methodology Team · Methodology Team
1 min read
aigovernancecochraneraisemethodology

The conversation about AI in systematic reviews has moved past whether to use it. By 2026, the question is what counts as defensible use. Two documents now anchor that answer: Cochrane's 2025 position on generative AI in evidence synthesis, and RAISE — the Responsible AI in Evidence Synthesis recommendations originating from methodologists at Bond University and collaborating institutions.

This is a working guide to what they require, where they overlap, where they diverge, and how to operationalize them inside a live review.

Why two frameworks, and why now?

Three factors forced the issue.

The first is volume. Marshall and Wallace (2019) estimated that the median systematic review screens roughly 1,800 records, with a long tail above 10,000. Living reviews and broad scoping reviews routinely cross 50,000. AI-assisted screening promises 30–70% workload reduction depending on tool and topic, and that promise is enough to make adoption inevitable.

The second is variance. The performance of an AI screening classifier is not a single number — it varies by topic, study type, language, and search composition. Marshall et al. (2018, 2023) and Khalil et al. (2022) repeatedly showed sensitivity dropping by 10–30 percentage points when tools trained on one corpus were applied to another. A tool that hits 99% recall on a CVD intervention review may miss one in five relevant records on a rare-disease prognostic review.

The third is the new generation of generative models. Where pre-2023 automation focused on classifiers trained per-review, GPT-class models can now follow natural-language inclusion criteria zero-shot. That is genuinely useful and genuinely dangerous: zero-shot models do not advertise their failure modes, and reviewers without methodological training may treat their output as authoritative.

Cochrane and RAISE exist because none of the prior reporting standards (PRISMA 2020, ROBINS-I, GRADE) were designed for any of this.

What the 2025 Cochrane position actually says

Cochrane's position evolved across two documents. The September 2023 statement set a moratorium on generative AI for any task that produced text appearing in a Cochrane review. The 2025 update, which followed broad community consultation, replaced the moratorium with a conditional framework.

The conditions cluster into three:

RequirementWhat it means in practice
Independent validationThe AI tool's performance must be evidenced for the specific task and population. Cross-topic validation does not transfer.
Human decisional authorityA human reviewer makes every final inclusion/exclusion, extraction, and risk-of-bias judgment. AI may pre-screen, suggest, or rank — never decide.
Full methodological reportingModel name, version, prompt or configuration, validation evidence, and the human review applied to outputs must appear in the methods section.

The 2025 position also expanded what is now permissible. Title/abstract pre-screening with a documented sensitivity threshold, structured data extraction for pre-defined fields (with human verification), search strategy peer review, and methodological writing assistance are all explicitly allowed under the conditions above.

What remains prohibited has narrowed but not vanished: generating verbatim conclusions, fabricating citations, replacing dual-reviewer screening without validation evidence demonstrating non-inferiority, and any AI use that the protocol does not declare in advance.

What RAISE adds

RAISE was drafted by Scott et al. through a consensus process running through 2024 and into 2025. It is not a permissions framework — it is a reporting framework, and that distinction matters.

Where Cochrane tells you what you may do inside a Cochrane review, RAISE tells any review (Cochrane or not) what to report so a reader can judge defensibility. Its 17 items are organized into five domains:

  1. Purpose — why AI was used and what it replaced or augmented
  2. Tool — model identity, version, vendor, training data lineage where known
  3. Task design — prompt or configuration, threshold settings, batch size, randomization
  4. Validation — sensitivity and specificity in the review's specific context, with sample size and ground-truth definition
  5. Human oversight — who reviewed AI output, how disagreements were resolved, and what proportion of AI judgments were overridden

The pragmatic value of RAISE is that it converts "we used AI" — a sentence that currently appears in roughly 14% of 2025 reviews per recent meta-research — into a structured statement that can be peer-reviewed, replicated, or contested.

Where they overlap and where they diverge

Reading the two side by side, the convergence is striking.

DimensionCochrane 2025RAISE
Validation required?Yes, in domainYes, with reportable sample size
Human authority required?Yes, on every decisionYes, with override rate reported
Prompt/config disclosure?YesYes (item 7)
Pre-registration of AI plan?Yes, in protocolRecommended (item 2)
Applies to non-Cochrane reviews?No (binding only on Cochrane)Yes (domain-agnostic)
EnforcementCochrane editorial processVoluntary, but adopted by JCE, BMJ EBM, others

The substantive divergence is narrow and lives in two places. First, RAISE is silent on permissibility — it does not tell you that generating conclusions is prohibited; it tells you that if you do it, you must report it. Second, Cochrane is silent on quantitative thresholds — it does not specify what sensitivity counts as "validated" — while RAISE pushes reviewers to publish their numbers and let readers decide.

In practice, methodologically rigorous teams treat them as a single requirement: validate, oversee, report.

How to operationalize this in a live review

The framework is clear. The hard part is making it routine. Five practices, in priority order, cover the realistic obligations of a 2026 review.

1. Declare AI tasks in the protocol, not after the fact

Add a section to your PROSPERO or OSF protocol titled "Planned AI assistance." Specify (a) the tasks AI will perform, (b) the tool you intend to use, and (c) the validation evidence supporting that intended use. If the validation is missing, plan a pilot. The Cochrane position is explicit that retroactive AI disclosure does not satisfy the requirement.

2. Pilot validate on a domain-matched sample

Before turning AI loose on the full record set, label 200–500 records from your search by hand (or pull a labeled sample from a published review on a sufficiently similar topic). Run the AI on that sample. Report sensitivity at the threshold you plan to use. If sensitivity is below your protocol-specified floor (most reviews use 95–99%), the tool fails the validation gate and you fall back to manual screening or a different tool.

For background on the metric choice itself, see Why 99% Recall Is the Floor for Screening Automation.

3. Preserve human authority at every decision point

The Cochrane and RAISE language on this is identical: a human makes the call. AI may rank, suggest, or pre-filter, but the inclusion/exclusion decision sits with a reviewer. In a workflow this looks like AI marking records as "likely include / likely exclude / uncertain," with a human reviewer confirming each inclusion and reviewing every "uncertain" record. Records the AI marks "likely exclude" are still subject to spot-check sampling at a documented rate (commonly 5–10%).

This is also where the choice between single- and dual-reviewer screening interacts with AI. AI-assisted single-reviewer screening with human spot-checks is methodologically defensible under RAISE; AI as a sole reviewer is not.

4. Track and report override rates

The override rate — the proportion of AI judgments a human reviewer reversed — is the single most informative metric for a peer reviewer trying to judge whether your AI use was responsible. Report it per task. A 1% override rate on 50,000 records means you trusted the tool. A 25% override rate means the tool was wrong often enough that a reviewer should reconsider whether AI added value here at all.

5. Report everything in the methods, not just "AI was used"

Use the RAISE 17-item structure as a methods-section checklist. The minimum viable disclosure includes model and version (e.g., "Mapped's screening classifier v3.2, deployed 2026-04-15"), prompt or threshold (e.g., "operating at 0.65 cutoff, calibrated for ≥99% recall"), validation evidence (sample size, sensitivity, specificity, source), and human oversight protocol (who reviewed what, override rate, dispute resolution).

Where this leaves industry tools

The 2025 Cochrane position effectively split the AI-assisted review tooling market into two camps.

The first camp ships AI features without validation evidence specific to the task or domain. Several mainstream platforms — Covidence, Rayyan, DistillerSR — have published or referenced general performance numbers, but the Cochrane position now requires per-review validation regardless of the vendor's marketing claims. That requirement is on the reviewer, not the vendor.

The second camp publishes per-task validation methodology and exposes the validation interface to users — meaning the tool runs a pilot on your labeled sample and reports sensitivity before it screens at scale. This is the model mapped's screening pipeline implements: every project runs a calibration pass on a user-labeled or imported sample and surfaces the operating characteristic before the AI screens any unseen records.

Neither camp removes the reviewer's responsibility. The position requires the human to validate the validation — that is, to confirm that the labeled sample was representative, the sensitivity threshold was protocol-defined, and the override rate was acceptable. Vendor evidence is necessary but not sufficient.

What this means for your next review

If your protocol is in design phase right now, three concrete steps:

  1. Add an "Planned AI assistance" section to the protocol and specify validation evidence per task. Use the RAISE 17-item checklist as the table of contents.
  2. Set the sensitivity floor explicitly. Most reviewers in 2026 use 99% for primary screening, 95% for exploratory or scoping, and 99.5% for high-stakes clinical guidelines.
  3. Decide your override-rate red line in advance. Above some threshold (commonly 15–20%), the team agrees to abandon AI-assisted screening for that task and revert to manual.

If your protocol is already locked and the AI plan is missing, you have two options. The cleaner one is a documented protocol amendment with date and rationale before AI screening begins. The less clean but defensible one is a methods-section disclosure that names the deviation, explains why, and presents the validation evidence retrospectively. Reviewers will accept the latter; AI methods peer reviewers increasingly will not.

Further reading

  • Scott AM, et al. RAISE: Responsible AI in Evidence Synthesis recommendations. Preprint and JCE peer-reviewed version, 2024–2025.
  • Cochrane. Position statement on the use of artificial intelligence in evidence synthesis (2025 update). cochrane.org/about-us/our-policies.
  • Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic Reviews, 2019.
  • Khalil H, et al. Tools to support the automation of systematic reviews: a scoping review. Journal of Clinical Epidemiology, 2022.
  • Page MJ, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ, 2021.
  • Affengruber L, et al. Selecting the best evidence: a comparison of search filters for rapid reviews. JCE, 2024.

For the practical decision of which AI-assisted tasks to enable in your specific review, see the three-axis decision framework. For the underlying screening metrics, see Why 99% Recall Is the Floor. For the broader trajectory, see How AI is Transforming Systematic Reviews.

Frequently asked questions

About the author

Mapped Methodology Team
Methodology Team · mapped

mapped is the AI research workspace for systematic reviews and meta-analyses. Our methodology team writes from inside live review workflows — no rephrased content, no theoretical posts.