The most common mistake teams make when they first introduce AI into a review is treating "use AI" as a single decision. It is not. A modern review has fifteen to twenty discrete tasks, and each one has a different risk profile. The 2025 Cochrane position and the RAISE framework agree on the boundaries, but neither of them tells you, task by task, where to draw the line.
This is the framework mapped's methodology team uses internally and recommends to research teams designing protocols. It draws on Covidence's three-question heuristic (Scott et al., 2024) and replaces the subjective user-willingness axis with a measurable, pre-registerable axis: required performance.
The three axes
| Axis | Question | Range |
|---|---|---|
| Impact of error | If the AI gets one decision wrong, how damaging is the consequence? | Low → Medium → High → Catastrophic |
| Oversight mode | What review does a human apply to AI output? | None → Spot-check → Full review of inclusions only → Full dual review |
| Required performance | What sensitivity/specificity is needed in your topic and study type, validated on a domain-matched sample? | Defined per task in the protocol |
A task is appropriate for AI assistance when, taken together, the impact is bounded by the oversight, and the required performance is met by validation evidence specific to your context. If any one of the three fails, the task is not safe for AI delegation in your review — even if it is safe in someone else's.
Axis 1 — Impact of error
The first axis is the most underweighted in practice. Reviewers tend to imagine the average AI error and then decide that errors are tolerable. The relevant question is not the average error — it is the worst error. A title-screening tool that misses 1% of relevant records is forgivable if the missed records are "noise" duplicates of records already in the set, and unforgivable if the missed record is the largest RCT in the literature.
Operationally, classify impact by what a single error damages:
- Low — adds a small amount of noise to a downstream step that is itself filtered (e.g., AI-suggested search terms feeding a search that is then peer-reviewed).
- Medium — affects a single record's path through the review without affecting the final synthesis (e.g., a misclassified extraction that a second reviewer catches).
- High — directly affects which records enter or leave the synthesis (e.g., a screening miss that excludes a relevant trial; a fabricated extracted value that enters meta-analysis).
- Catastrophic — undermines the review's defensibility as a whole (e.g., a fabricated citation in the manuscript; a hallucinated quote attributed to a primary source).
The impact level sets the floor for oversight and required performance. High and catastrophic impact require either tight oversight or extremely well-validated performance — preferably both.
Axis 2 — Oversight mode
Oversight is the corrective for impact. The 2025 Cochrane position requires some human authority on every decision, but it does not specify the form. The framework treats oversight as a four-step ladder.
| Mode | What the human does | When it's appropriate |
|---|---|---|
| None | AI output flows directly into the next step. | Almost never in a methods-grade review. |
| Spot-check | Human reviews a documented random sample of AI judgments at a defined rate. | Low-impact tasks where the validation evidence already establishes performance and the spot-check audits drift. |
| Full review of one class | Human reviews every AI judgment of one class (e.g., every AI "include"; every AI-extracted numerical value). | High-impact tasks where AI is acting as a triage filter. |
| Full dual review | Human and AI both judge every record; conflicts adjudicated by a third reviewer. | Catastrophic-impact tasks; tasks where validation evidence is borderline or missing. |
The cleanest workflow for screening is "AI ranks → human reviews everything ranked above a threshold + a 5–10% random sample below it." This is the oversight mode mapped's screening pipeline defaults to.
For deeper context on the metric thresholds underneath this ladder, see Why 99% Recall Is the Floor.
Axis 3 — Required performance
This is the axis the framework adds beyond Covidence's heuristic. "Required performance" is the sensitivity and specificity your task needs in your topic, validated on a sample drawn from your search, with the threshold set in your protocol.
Three rules govern this axis.
-
Required performance is task-specific. A 99% sensitivity floor for primary screening is conventional. A 95% floor is reasonable for scoping reviews where the universe is intentionally exploratory. A 99.5% floor is appropriate for high-stakes clinical guidelines. The number is set in the protocol; the AI either meets it or doesn't.
-
Required performance is domain-specific. The Marshall et al. (2018, 2023) and van Dinter et al. (2021) results are unambiguous: AI screening sensitivity drops 10–30 percentage points when a tool trained on one corpus runs against a different one. Industry-published benchmarks are not a substitute for validation on your own labeled sample.
-
Required performance must be validated before, not during. Validation during full-set screening is a sample-size confound — you are measuring on records the tool has already filtered. Validation must happen on a representative held-out sample before the AI runs at scale.
If validation evidence is below the required performance, the task fails the gate. The remediation is one of three: tighten the oversight (move from spot-check to full review of inclusions), lower the threshold (only acceptable if the protocol allows the lower floor), or fall back to manual.
Plotting six common SR tasks on the matrix
The framework gets concrete when you apply it. Here is how mapped scores the six most common AI candidate tasks in a 2026 review.
| Task | Impact | Defensible oversight | Required performance | AI defensible today? |
|---|---|---|---|---|
| Search query generation | Low–Medium | Human reviews and edits; PRESS peer review | Not applicable (output is reviewed in full) | Yes, with PRESS peer review |
| Title/abstract screening | High | Full review of AI "include"; spot-check of AI "exclude" | ≥99% sensitivity on domain-matched sample | Yes, with validation |
| Full-text screening | High | Full review of every record reaching this stage | ≥99% sensitivity; specificity tracked | Yes, with validation |
| Structured data extraction | Medium–High | Full review of every extracted value before lock | Field-level accuracy ≥95% on domain sample | Yes, with full review |
| Risk-of-bias signaling-question scoring | High | Full review of every AI suggestion | Domain-validated agreement with human assessors | Cautiously, with full review |
| Manuscript drafting | Catastrophic | Sentence-level human review; citation verification | No validated AI passes a "no fabrication" test at scale | Drafting only; no claims or citations |
The pattern is intentional. AI is most defensible where impact is bounded and oversight is full; least defensible where it is asked to produce text that will be cited.
Where the framework fails — and how to compensate
Three failure modes are worth flagging in the protocol so they do not surprise the team mid-review.
The impact axis is hard to estimate for fabrication. Generative models occasionally produce fluent text that is wrong in undetectable ways. The failure mode of a fabricated extracted value is not "extraction error" — it is "extraction error that looks correct." Compensate by treating any AI-generated value as untrusted until verified against the source PDF, and by recording the verification.
The oversight axis is overrated under time pressure. Teams routinely set "full review of AI inclusions" in the protocol and slip to "spot-check" by month three when the review is behind schedule. Pre-register the override and dispute logs in the protocol; the audit trail is the actual oversight.
The required-performance axis is degraded by drift. A model deployed in March 2026 is not the same as the model deployed in November 2026 — vendors update, classifiers retrain, behavior changes. Pre-register a re-validation cadence (monthly is standard for living reviews; once-per-major-step is standard for traditional reviews).
The shape of a defensible AI plan
A protocol section that satisfies the framework looks like this. The structure is portable across tools.
Planned AI assistance. For primary title/abstract screening, we will use [tool name and version], operating at a sensitivity threshold of 99%, validated on a held-out sample of 400 records labeled by two reviewers (Cohen's κ = 0.84). The AI will rank records; one reviewer will full-review every AI "include" and a 10% random sample of AI "exclude." Disagreements between reviewer and AI will be logged. Override rate will be reported in the methods section. Pre-validation will be repeated if the model is updated by the vendor.
This is one paragraph. It satisfies the 2025 Cochrane position and the RAISE 17-item checklist for that single task. Replicate the paragraph for each AI-assisted task in the review.
Where this framework leaves industry tools
The framework is tool-agnostic by design. It tells you what evidence the tool needs to ship, not which tool to choose. The tools that fit cleanly into the framework are the ones that (a) expose validation interfaces — meaning you can run a calibration pass on your labeled sample and see the operating characteristic before AI screens at scale — and (b) log overrides per task per reviewer for the methods section.
Mapped's screening pipeline is built around the matrix: every project starts with a calibration step that surfaces sensitivity and specificity on a user-labeled sample, sets the operating threshold, and logs override rates from there. Other platforms (Covidence, DistillerSR, Rayyan) implement subsets of this; the framework helps reviewers ask vendors the right questions about which subset.
Putting it to work this week
Three things you can do today, before the next protocol meeting:
- List every task in your draft review where AI is plausibly useful. Stop at the list — do not yet decide.
- For each task, fill in the three axes: impact (low/medium/high/catastrophic), oversight (none/spot-check/full-class/full-dual), and required performance (the sensitivity/specificity floor in your topic).
- Put the table in the protocol. The act of writing it changes the conversation from "is AI okay?" to "what evidence justifies this specific task?"
This is the framework's only job: to make the question per-task, the answer pre-registered, and the audit trail real. Everything that follows from there is execution.
Further reading
- Scott AM, et al. RAISE: Responsible AI in Evidence Synthesis recommendations. Bond University, 2024–2025.
- Cochrane. Position statement on the use of artificial intelligence in evidence synthesis (2025 update).
- Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic Reviews, 2019.
- van Dinter R, Tekinerdogan B, Catal C. Automation of systematic literature reviews: A systematic literature review. Information and Software Technology, 2021.
- Khalil H, et al. Tools to support the automation of systematic reviews: a scoping review. JCE, 2022.
For the upstream policy context, see Responsible AI in Systematic Reviews. For the metric layer the framework relies on, see Why 99% Recall Is the Floor. For the broader trajectory in evidence synthesis, see How AI is Transforming Systematic Reviews.