Does the 2025 Cochrane position statement allow generative AI in systematic reviews?

Yes, but conditionally. Cochrane permits AI-assisted tasks where (a) the tool's performance has been independently validated against the use case, (b) human reviewers retain decisional authority, and (c) the AI's role is fully reported in the methods. Generating verbatim conclusions, fabricating citations, or replacing dual-reviewer screening without validation evidence remains prohibited.

What is RAISE and how is it different from the Cochrane position?

RAISE (Responsible AI in Evidence Synthesis) is a multi-stakeholder reporting framework drafted by methodologists at Bond University and collaborators. Where the Cochrane position sets boundaries for Cochrane reviews, RAISE provides a domain-agnostic checklist of items that any review using AI should report. The two are complementary: Cochrane defines what is permitted; RAISE defines what is reportable.

Do I need to validate an AI screening tool before using it in my review?

If the tool's published validation does not match your domain, study type, and language, then yes — at minimum, perform a pilot validation on a labeled sample of 200–500 records from your search and report sensitivity (recall) at the threshold you intend to use. The 2025 Cochrane position is explicit that 'validation transferability cannot be assumed across topics.'

Can I use ChatGPT to screen titles and abstracts?

General-purpose LLMs without domain fine-tuning have not been independently validated for screening at the recall thresholds (≥95–99%) most reviews require. They can be used as a second look, a triage layer, or to draft inclusion/exclusion justifications, but neither RAISE nor the 2025 Cochrane position currently support them as a sole or replacement reviewer for primary screening.

How should I report AI use in my methods section?

Report the model name and version, the task it performed, the prompt or configuration used, the validation evidence supporting that task, the human review applied to its outputs, and any errors or disagreements. RAISE provides a structured 17-item checklist. PRISMA 2020 also expects AI use to be disclosed under 'data collection' and 'study selection' methods.

Responsible AI in Systematic Reviews: What the 2025 Cochrane Position and RAISE Actually Require

The conversation about AI in systematic reviews has moved past whether to use it. By 2026, the question is what counts as defensible use. Two documents now anchor that answer: Cochrane's 2025 position on generative AI in evidence synthesis, and RAISE — the Responsible AI in Evidence Synthesis recommendations originating from methodologists at Bond University and collaborating institutions.

This is a working guide to what they require, where they overlap, where they diverge, and how to operationalize them inside a live review.

Why two frameworks, and why now?

Three factors forced the issue.

The first is volume. Marshall and Wallace (2019) estimated that the median systematic review screens roughly 1,800 records, with a long tail above 10,000. Living reviews and broad scoping reviews routinely cross 50,000. AI-assisted screening promises 30–70% workload reduction depending on tool and topic, and that promise is enough to make adoption inevitable.

The second is variance. The performance of an AI screening classifier is not a single number — it varies by topic, study type, language, and search composition. Marshall et al. (2018, 2023) and Khalil et al. (2022) repeatedly showed sensitivity dropping by 10–30 percentage points when tools trained on one corpus were applied to another. A tool that hits 99% recall on a CVD intervention review may miss one in five relevant records on a rare-disease prognostic review.

The third is the new generation of generative models. Where pre-2023 automation focused on classifiers trained per-review, GPT-class models can now follow natural-language inclusion criteria zero-shot. That is genuinely useful and genuinely dangerous: zero-shot models do not advertise their failure modes, and reviewers without methodological training may treat their output as authoritative.

Cochrane and RAISE exist because none of the prior reporting standards (PRISMA 2020, ROBINS-I, GRADE) were designed for any of this.

What the 2025 Cochrane position actually says

Cochrane's position evolved across two documents. The September 2023 statement set a moratorium on generative AI for any task that produced text appearing in a Cochrane review. The 2025 update, which followed broad community consultation, replaced the moratorium with a conditional framework.

The conditions cluster into three:

Requirement	What it means in practice
Independent validation	The AI tool's performance must be evidenced for the specific task and population. Cross-topic validation does not transfer.
Human decisional authority	A human reviewer makes every final inclusion/exclusion, extraction, and risk-of-bias judgment. AI may pre-screen, suggest, or rank — never decide.
Full methodological reporting	Model name, version, prompt or configuration, validation evidence, and the human review applied to outputs must appear in the methods section.

The 2025 position also expanded what is now permissible. Title/abstract pre-screening with a documented sensitivity threshold, structured data extraction for pre-defined fields (with human verification), search strategy peer review, and methodological writing assistance are all explicitly allowed under the conditions above.

What remains prohibited has narrowed but not vanished: generating verbatim conclusions, fabricating citations, replacing dual-reviewer screening without validation evidence demonstrating non-inferiority, and any AI use that the protocol does not declare in advance.

What RAISE adds

RAISE was drafted by Scott et al. through a consensus process running through 2024 and into 2025. It is not a permissions framework — it is a reporting framework, and that distinction matters.

Where Cochrane tells you what you may do inside a Cochrane review, RAISE tells any review (Cochrane or not) what to report so a reader can judge defensibility. Its 17 items are organized into five domains:

Purpose — why AI was used and what it replaced or augmented
Tool — model identity, version, vendor, training data lineage where known
Task design — prompt or configuration, threshold settings, batch size, randomization
Validation — sensitivity and specificity in the review's specific context, with sample size and ground-truth definition
Human oversight — who reviewed AI output, how disagreements were resolved, and what proportion of AI judgments were overridden

The pragmatic value of RAISE is that it converts "we used AI" — a sentence that currently appears in roughly 14% of 2025 reviews per recent meta-research — into a structured statement that can be peer-reviewed, replicated, or contested.

Where they overlap and where they diverge

Reading the two side by side, the convergence is striking.

Dimension	Cochrane 2025	RAISE
Validation required?	Yes, in domain	Yes, with reportable sample size
Human authority required?	Yes, on every decision	Yes, with override rate reported
Prompt/config disclosure?	Yes	Yes (item 7)
Pre-registration of AI plan?	Yes, in protocol	Recommended (item 2)
Applies to non-Cochrane reviews?	No (binding only on Cochrane)	Yes (domain-agnostic)
Enforcement	Cochrane editorial process	Voluntary, but adopted by JCE, BMJ EBM, others

The substantive divergence is narrow and lives in two places. First, RAISE is silent on permissibility — it does not tell you that generating conclusions is prohibited; it tells you that if you do it, you must report it. Second, Cochrane is silent on quantitative thresholds — it does not specify what sensitivity counts as "validated" — while RAISE pushes reviewers to publish their numbers and let readers decide.

In practice, methodologically rigorous teams treat them as a single requirement: validate, oversee, report.

How to operationalize this in a live review

The framework is clear. The hard part is making it routine. Five practices, in priority order, cover the realistic obligations of a 2026 review.

1. Declare AI tasks in the protocol, not after the fact

Add a section to your PROSPERO or OSF protocol titled "Planned AI assistance." Specify (a) the tasks AI will perform, (b) the tool you intend to use, and (c) the validation evidence supporting that intended use. If the validation is missing, plan a pilot. The Cochrane position is explicit that retroactive AI disclosure does not satisfy the requirement.

2. Pilot validate on a domain-matched sample

Before turning AI loose on the full record set, label 200–500 records from your search by hand (or pull a labeled sample from a published review on a sufficiently similar topic). Run the AI on that sample. Report sensitivity at the threshold you plan to use. If sensitivity is below your protocol-specified floor (most reviews use 95–99%), the tool fails the validation gate and you fall back to manual screening or a different tool.

For background on the metric choice itself, see Why 99% Recall Is the Floor for Screening Automation.

3. Preserve human authority at every decision point

The Cochrane and RAISE language on this is identical: a human makes the call. AI may rank, suggest, or pre-filter, but the inclusion/exclusion decision sits with a reviewer. In a workflow this looks like AI marking records as "likely include / likely exclude / uncertain," with a human reviewer confirming each inclusion and reviewing every "uncertain" record. Records the AI marks "likely exclude" are still subject to spot-check sampling at a documented rate (commonly 5–10%).

This is also where the choice between single- and dual-reviewer screening interacts with AI. AI-assisted single-reviewer screening with human spot-checks is methodologically defensible under RAISE; AI as a sole reviewer is not.

4. Track and report override rates

The override rate — the proportion of AI judgments a human reviewer reversed — is the single most informative metric for a peer reviewer trying to judge whether your AI use was responsible. Report it per task. A 1% override rate on 50,000 records means you trusted the tool. A 25% override rate means the tool was wrong often enough that a reviewer should reconsider whether AI added value here at all.

5. Report everything in the methods, not just "AI was used"

Use the RAISE 17-item structure as a methods-section checklist. The minimum viable disclosure includes model and version (e.g., "Mapped's screening classifier v3.2, deployed 2026-04-15"), prompt or threshold (e.g., "operating at 0.65 cutoff, calibrated for ≥99% recall"), validation evidence (sample size, sensitivity, specificity, source), and human oversight protocol (who reviewed what, override rate, dispute resolution).

Where this leaves industry tools

The 2025 Cochrane position effectively split the AI-assisted review tooling market into two camps.

The first camp ships AI features without validation evidence specific to the task or domain. Several mainstream platforms — Covidence, Rayyan, DistillerSR — have published or referenced general performance numbers, but the Cochrane position now requires per-review validation regardless of the vendor's marketing claims. That requirement is on the reviewer, not the vendor.

The second camp publishes per-task validation methodology and exposes the validation interface to users — meaning the tool runs a pilot on your labeled sample and reports sensitivity before it screens at scale. This is the model mapped's screening pipeline implements: every project runs a calibration pass on a user-labeled or imported sample and surfaces the operating characteristic before the AI screens any unseen records.

Neither camp removes the reviewer's responsibility. The position requires the human to validate the validation — that is, to confirm that the labeled sample was representative, the sensitivity threshold was protocol-defined, and the override rate was acceptable. Vendor evidence is necessary but not sufficient.

What this means for your next review

If your protocol is in design phase right now, three concrete steps:

Add an "Planned AI assistance" section to the protocol and specify validation evidence per task. Use the RAISE 17-item checklist as the table of contents.
Set the sensitivity floor explicitly. Most reviewers in 2026 use 99% for primary screening, 95% for exploratory or scoping, and 99.5% for high-stakes clinical guidelines.
Decide your override-rate red line in advance. Above some threshold (commonly 15–20%), the team agrees to abandon AI-assisted screening for that task and revert to manual.

If your protocol is already locked and the AI plan is missing, you have two options. The cleaner one is a documented protocol amendment with date and rationale before AI screening begins. The less clean but defensible one is a methods-section disclosure that names the deviation, explains why, and presents the validation evidence retrospectively. Reviewers will accept the latter; AI methods peer reviewers increasingly will not.