What is the simplest version of this framework?

Three questions: How damaging is one error? How closely will a human watch the AI? How accurate does the AI need to be in this exact context? If 'damaging × loose oversight × low validated accuracy,' do not delegate the task. If 'tolerable × tight oversight × validated accuracy,' delegation is defensible.

Why three axes instead of a binary use/don't-use rule?

Because the same AI tool can be appropriate for one task in your review and inappropriate for another. Title/abstract triage and final inclusion decisions sit at very different points on the impact-of-error axis, even when the underlying classifier is identical. A binary rule forces a single answer that is wrong for half the cases.

How is this different from the Covidence three-question framework?

Covidence's framework asks whether AI can do the task, whether it should, and whether the user is willing. The mapped framework keeps the spirit but replaces the user-willingness axis (which is subjective and not reproducible) with required performance, which is measurable and pre-registerable. The substantive overlap is intentional; the precision difference is what makes the matrix usable in protocols.

Where does the framework break down?

It breaks down on tasks where the impact of error is hard to estimate — most notably manuscript drafting, where one fabricated citation can collapse trust in the whole document but the average error is harmless. For drafting tasks, treat impact as 'high' even if the typical case looks low, because a single fabrication is catastrophic.

Can the framework justify removing humans entirely from a step?

No. The framework can justify shifting the human role from primary reviewer to spot-check sampler, but the 2025 Cochrane position and RAISE both require human decisional authority somewhere in the pipeline for every inclusion/exclusion. The matrix tells you where that human sits, not whether one is needed.

When Is AI Appropriate in a Systematic Review? A Three-Axis Decision Framework

The most common mistake teams make when they first introduce AI into a review is treating "use AI" as a single decision. It is not. A modern review has fifteen to twenty discrete tasks, and each one has a different risk profile. The 2025 Cochrane position and the RAISE framework agree on the boundaries, but neither of them tells you, task by task, where to draw the line.

This is the framework mapped's methodology team uses internally and recommends to research teams designing protocols. It draws on Covidence's three-question heuristic (Scott et al., 2024) and replaces the subjective user-willingness axis with a measurable, pre-registerable axis: required performance.

The three axes

Axis	Question	Range
Impact of error	If the AI gets one decision wrong, how damaging is the consequence?	Low → Medium → High → Catastrophic
Oversight mode	What review does a human apply to AI output?	None → Spot-check → Full review of inclusions only → Full dual review
Required performance	What sensitivity/specificity is needed in your topic and study type, validated on a domain-matched sample?	Defined per task in the protocol

A task is appropriate for AI assistance when, taken together, the impact is bounded by the oversight, and the required performance is met by validation evidence specific to your context. If any one of the three fails, the task is not safe for AI delegation in your review — even if it is safe in someone else's.

Axis 1 — Impact of error

The first axis is the most underweighted in practice. Reviewers tend to imagine the average AI error and then decide that errors are tolerable. The relevant question is not the average error — it is the worst error. A title-screening tool that misses 1% of relevant records is forgivable if the missed records are "noise" duplicates of records already in the set, and unforgivable if the missed record is the largest RCT in the literature.

Operationally, classify impact by what a single error damages:

Low — adds a small amount of noise to a downstream step that is itself filtered (e.g., AI-suggested search terms feeding a search that is then peer-reviewed).
Medium — affects a single record's path through the review without affecting the final synthesis (e.g., a misclassified extraction that a second reviewer catches).
High — directly affects which records enter or leave the synthesis (e.g., a screening miss that excludes a relevant trial; a fabricated extracted value that enters meta-analysis).
Catastrophic — undermines the review's defensibility as a whole (e.g., a fabricated citation in the manuscript; a hallucinated quote attributed to a primary source).

The impact level sets the floor for oversight and required performance. High and catastrophic impact require either tight oversight or extremely well-validated performance — preferably both.

Axis 2 — Oversight mode

Oversight is the corrective for impact. The 2025 Cochrane position requires some human authority on every decision, but it does not specify the form. The framework treats oversight as a four-step ladder.

Mode	What the human does	When it's appropriate
None	AI output flows directly into the next step.	Almost never in a methods-grade review.
Spot-check	Human reviews a documented random sample of AI judgments at a defined rate.	Low-impact tasks where the validation evidence already establishes performance and the spot-check audits drift.
Full review of one class	Human reviews every AI judgment of one class (e.g., every AI "include"; every AI-extracted numerical value).	High-impact tasks where AI is acting as a triage filter.
Full dual review	Human and AI both judge every record; conflicts adjudicated by a third reviewer.	Catastrophic-impact tasks; tasks where validation evidence is borderline or missing.

The cleanest workflow for screening is "AI ranks → human reviews everything ranked above a threshold + a 5–10% random sample below it." This is the oversight mode mapped's screening pipeline defaults to.

For deeper context on the metric thresholds underneath this ladder, see Why 99% Recall Is the Floor.

Axis 3 — Required performance

This is the axis the framework adds beyond Covidence's heuristic. "Required performance" is the sensitivity and specificity your task needs in your topic, validated on a sample drawn from your search, with the threshold set in your protocol.

Three rules govern this axis.

Required performance is task-specific. A 99% sensitivity floor for primary screening is conventional. A 95% floor is reasonable for scoping reviews where the universe is intentionally exploratory. A 99.5% floor is appropriate for high-stakes clinical guidelines. The number is set in the protocol; the AI either meets it or doesn't.
Required performance is domain-specific. The Marshall et al. (2018, 2023) and van Dinter et al. (2021) results are unambiguous: AI screening sensitivity drops 10–30 percentage points when a tool trained on one corpus runs against a different one. Industry-published benchmarks are not a substitute for validation on your own labeled sample.
Required performance must be validated before, not during. Validation during full-set screening is a sample-size confound — you are measuring on records the tool has already filtered. Validation must happen on a representative held-out sample before the AI runs at scale.

If validation evidence is below the required performance, the task fails the gate. The remediation is one of three: tighten the oversight (move from spot-check to full review of inclusions), lower the threshold (only acceptable if the protocol allows the lower floor), or fall back to manual.

Plotting six common SR tasks on the matrix

The framework gets concrete when you apply it. Here is how mapped scores the six most common AI candidate tasks in a 2026 review.

Task	Impact	Defensible oversight	Required performance	AI defensible today?
Search query generation	Low–Medium	Human reviews and edits; PRESS peer review	Not applicable (output is reviewed in full)	Yes, with PRESS peer review
Title/abstract screening	High	Full review of AI "include"; spot-check of AI "exclude"	≥99% sensitivity on domain-matched sample	Yes, with validation
Full-text screening	High	Full review of every record reaching this stage	≥99% sensitivity; specificity tracked	Yes, with validation
Structured data extraction	Medium–High	Full review of every extracted value before lock	Field-level accuracy ≥95% on domain sample	Yes, with full review
Risk-of-bias signaling-question scoring	High	Full review of every AI suggestion	Domain-validated agreement with human assessors	Cautiously, with full review
Manuscript drafting	Catastrophic	Sentence-level human review; citation verification	No validated AI passes a "no fabrication" test at scale	Drafting only; no claims or citations

The pattern is intentional. AI is most defensible where impact is bounded and oversight is full; least defensible where it is asked to produce text that will be cited.

Where the framework fails — and how to compensate

Three failure modes are worth flagging in the protocol so they do not surprise the team mid-review.

The impact axis is hard to estimate for fabrication. Generative models occasionally produce fluent text that is wrong in undetectable ways. The failure mode of a fabricated extracted value is not "extraction error" — it is "extraction error that looks correct." Compensate by treating any AI-generated value as untrusted until verified against the source PDF, and by recording the verification.

The oversight axis is overrated under time pressure. Teams routinely set "full review of AI inclusions" in the protocol and slip to "spot-check" by month three when the review is behind schedule. Pre-register the override and dispute logs in the protocol; the audit trail is the actual oversight.

The required-performance axis is degraded by drift. A model deployed in March 2026 is not the same as the model deployed in November 2026 — vendors update, classifiers retrain, behavior changes. Pre-register a re-validation cadence (monthly is standard for living reviews; once-per-major-step is standard for traditional reviews).

The shape of a defensible AI plan

A protocol section that satisfies the framework looks like this. The structure is portable across tools.

Planned AI assistance. For primary title/abstract screening, we will use [tool name and version], operating at a sensitivity threshold of 99%, validated on a held-out sample of 400 records labeled by two reviewers (Cohen's κ = 0.84). The AI will rank records; one reviewer will full-review every AI "include" and a 10% random sample of AI "exclude." Disagreements between reviewer and AI will be logged. Override rate will be reported in the methods section. Pre-validation will be repeated if the model is updated by the vendor.

This is one paragraph. It satisfies the 2025 Cochrane position and the RAISE 17-item checklist for that single task. Replicate the paragraph for each AI-assisted task in the review.

Where this framework leaves industry tools

The framework is tool-agnostic by design. It tells you what evidence the tool needs to ship, not which tool to choose. The tools that fit cleanly into the framework are the ones that (a) expose validation interfaces — meaning you can run a calibration pass on your labeled sample and see the operating characteristic before AI screens at scale — and (b) log overrides per task per reviewer for the methods section.

Mapped's screening pipeline is built around the matrix: every project starts with a calibration step that surfaces sensitivity and specificity on a user-labeled sample, sets the operating threshold, and logs override rates from there. Other platforms (Covidence, DistillerSR, Rayyan) implement subsets of this; the framework helps reviewers ask vendors the right questions about which subset.

Putting it to work this week

Three things you can do today, before the next protocol meeting:

List every task in your draft review where AI is plausibly useful. Stop at the list — do not yet decide.
For each task, fill in the three axes: impact (low/medium/high/catastrophic), oversight (none/spot-check/full-class/full-dual), and required performance (the sensitivity/specificity floor in your topic).
Put the table in the protocol. The act of writing it changes the conversation from "is AI okay?" to "what evidence justifies this specific task?"

This is the framework's only job: to make the question per-task, the answer pre-registered, and the audit trail real. Everything that follows from there is execution.