Why is recall the dominant metric in systematic review screening?

Because the cost of a false negative — a missed relevant study — is asymmetric. A missed relevant study can change the conclusion of the entire review and is essentially undetectable downstream. A false positive (an irrelevant study marked relevant) costs reviewer time but is caught at the next stage. Recall measures the rate at which relevant studies are kept; everything else is secondary.

Where does the 99% number come from?

It is a convention drawn from method evaluations of supervised classifiers (Marshall et al. 2018; Khalil et al. 2022) and the 95% baseline in the Cochrane Handbook for human dual-reviewer agreement. The argument: AI-assisted screening should not be less sensitive than the human baseline it is replacing. 99% is the conservative interpretation of that constraint and is what most journals' AI peer reviewers expect to see in 2026.

What is workload saving and how is it computed?

Workload saving is the proportion of records the AI excluded with confidence, measured against the total records screened. If your search returned 10,000 records and AI confidently excluded 6,500 with 99% recall on validation, your workload saving is 65%. The remaining 35% (3,500 records) are reviewed manually or in a high-priority queue. WSS@95 — workload saved at 95% recall — is the published variant for benchmark comparisons.

Why does precision matter at all if recall is the floor?

Precision determines the proportion of AI-included records that are actually relevant. Low precision at high recall means a large queue of false positives a reviewer must clear. The economic value of AI screening is precision × recall: high recall keeps you safe, high precision makes it fast. A tool with 99% recall and 20% precision is safe but barely useful; a tool with 99% recall and 60% precision is the goal.

Should I use AUROC or sensitivity at threshold to evaluate a screening tool?

Sensitivity at a defined threshold. AUROC summarizes performance across all thresholds, which is useful for model comparison but irrelevant to a reviewer who must commit to a single operating point. Report sensitivity, specificity, and number-needed-to-screen at the threshold you intend to deploy. The threshold is the methods-section commitment; AUROC is a research artifact.

Why 99% Recall Is the Floor for Screening Automation (And What Precision Buys You)

The conversation about AI screening tools is often framed in marketing language: "70% workload reduction," "smart screening," "AI-assisted prioritization." None of these mean anything until the underlying metrics are pinned down. This is a working explainer on the four numbers that actually decide whether an AI screening tool is safe, useful, and reportable in your review.

The metrics, defined for screening

Screening is binary classification: each record is judged relevant (positive) or not (negative). Four numbers describe how a classifier performs on a labeled validation sample.

Metric	Definition	What it measures
Recall (sensitivity)	TP ÷ (TP + FN)	Of all truly relevant records, what fraction did the classifier keep?
Precision	TP ÷ (TP + FP)	Of all records the classifier said are relevant, what fraction actually are?
Specificity	TN ÷ (TN + FP)	Of all truly irrelevant records, what fraction did the classifier correctly exclude?
F1	2 × (precision × recall) ÷ (precision + recall)	Single-number summary, dominated by whichever of precision/recall is lower.

A fifth quantity is operationally important: workload saving, sometimes WSS@N — the proportion of total records the classifier confidently excluded at recall N.

Three points are easy to lose if you have not built classifiers before:

Recall and precision trade off as you move the operating threshold. Higher recall costs precision; higher precision costs recall.
Specificity and precision are different. Specificity is a property of the negative class; precision is a property of the positive class. In a screening set with 5% prevalence, 95% specificity is poor (you have 5x as many false positives as true positives) while 95% precision is excellent.
None of these numbers are intrinsic to a model. They are properties of model + threshold + dataset. Reporting recall without specifying threshold and dataset is uninformative.

Why recall is the floor

In screening, errors are asymmetric. A false negative removes a potentially relevant study from your synthesis. A false positive adds a record to a manual review queue that has to be cleared anyway. The first is essentially invisible — there is no downstream signal that a relevant study was missed unless you happen to find it in a hand-search. The second is loud: a reviewer sees the record, judges it, and excludes it.

This asymmetry is why every screening framework — from the Cochrane Handbook to PRISMA-S to the 2025 Cochrane AI position — anchors on recall.

The 99% threshold itself is a conservative reading of two facts.

The first is the Cochrane Handbook's expectation that dual-reviewer screening achieves ≥95% sensitivity (with the second reviewer catching what the first misses). The original empirical basis comes from Edwards et al. (2002) and reproduced in subsequent inter-rater reliability work, summarized in our explainer on inter-rater reliability for screening.

The second is the published evaluation literature on supervised screening classifiers. Marshall et al. (2018) and Khalil et al. (2022) showed that high-performing classifiers can sustain ≥99% sensitivity at meaningful precision in many topic areas — but that performance varies sharply by topic.

The argument is then conservative: an AI-assisted classifier should not perform below the human dual-reviewer baseline it replaces. 95% is the human floor; 99% is the safety margin most editorial review processes now expect for AI-assisted primary screening.

For broader-scope reviews where the universe is intentionally exploratory — scoping reviews, mapping reviews — 95% is defensible if the protocol commits to it in advance. For high-stakes clinical guideline reviews, 99.5% is increasingly common.

What precision actually buys you

Recall keeps you safe. Precision is what makes AI-assisted screening worth the setup cost. The arithmetic is simple but worth seeing.

Imagine a search returning 10,000 records with a true relevance rate of 5% (500 truly relevant records).

Recall	Precision	True positives kept	False positives in queue	Total queue to clear	Workload saved
99%	20%	495	1,980	2,475	75.3%
99%	40%	495	743	1,238	87.6%
99%	60%	495	330	825	91.8%
99%	80%	495	124	619	93.8%

Each of these tools has identical safety properties. They differ entirely in how much queue the reviewer must clear. The "70% workload reduction" claim that vendors put on slide decks tends to assume a precision in the 25–35% range. The lived experience of a reviewer using such a tool is "I still screened a lot of records." Precision in the 50%+ range is what changes that experience.

This is also what number-needed-to-screen (NNS) measures: the average count of records a reviewer must screen to find one truly relevant record. NNS = 1 ÷ precision. At 20% precision, NNS = 5; at 60% precision, NNS = 1.67.

How to validate a screening tool on your topic

The Cochrane 2025 position requires per-topic validation. The mechanics are simple.

Build a labeled sample. 200 records is the minimum credible sample; 400–500 is preferable. Records should be sampled from your actual search, not from the general literature. Two reviewers label them blind to each other; disagreements are adjudicated.
Run the AI on the sample, blind to labels. The AI produces a relevance score for each record.
Calibrate the threshold. Sort records by score. Find the threshold at which sensitivity reaches your protocol-specified floor (typically 99%). Read off the corresponding precision and specificity at that threshold.
Decide. If precision at the chosen threshold is acceptable to you (most teams set 30%+ as the operational floor for value), deploy. If not, the tool fails the gate for this topic.
Re-validate periodically. Re-run validation if the model is updated, if the search is significantly extended, or quarterly in long-running living reviews.

Validation must happen before the AI screens at scale. Validating during full screening is a sample-size confound — you are measuring on records the tool has already filtered.

For the broader decision of whether to use AI for screening at all in your review, see the three-axis decision framework. For the policy framework that requires this validation, see Responsible AI in Systematic Reviews.

What 99% recall does not mean

The conventional 99% recall floor is widely repeated and frequently misread. Three clarifications worth flagging in any methods-section discussion.

99% recall is not 99% accuracy. Accuracy weights both classes equally. In a 5%-prevalent screening set, a classifier that says "exclude" to everything achieves 95% accuracy and 0% recall. Accuracy is not a useful screening metric.

99% recall on validation is not 99% recall on the full set. Validation produces an estimate. The 95% confidence interval on a sensitivity estimate from 400 records with ~50 positives is roughly ±4 percentage points. The reported sensitivity is the point estimate; the protocol should commit to the lower bound of the interval as the operating expectation.

99% recall is not zero-loss. It is one-in-a-hundred loss. On a search of 10,000 records with 500 relevant, 99% recall means you accept missing roughly 5 relevant records. Whether that loss is acceptable depends on what those 5 records are likely to be. Spot-check sampling of AI-excluded records (5–10% of excludes, randomly drawn) is the standard mitigation.

How tools differ on these numbers

Most published evaluations of mainstream screening tools sit in similar ranges, with substantial topic-by-topic variance. Cohen et al. (2006), Wallace et al. (2010), Marshall et al. (2018), Howard et al. (2020), and Khalil et al. (2022) collectively support a band of ~85–99% sensitivity at precision in the 20–60% range for general screening tasks, depending on tool, topic, and threshold.

The relevant comparison is not between tools at advertised performance — it is between the same tool's performance in your topic, on your validation sample. The Marshall et al. (2018) corollary is that "validation transferability cannot be assumed across topics." Two tools with identical published numbers can perform very differently on a CVD intervention review, a rare-disease prognostic review, and a non-English-language qualitative synthesis.

This is the operational implication of the validation requirement that the 2025 Cochrane position made binding. The tools that handle this well are the ones that ship a calibration interface — meaning the user can run a pilot on their labeled sample and read off the operating characteristic before deploying. Mapped's screening pipeline implements this as a required first step; calibration must complete and be approved before AI screens any unseen record.

The shape of a defensible screening methods paragraph

A reportable screening section that satisfies PRISMA 2020, the 2025 Cochrane position, and RAISE looks roughly like this.

Screening. Records were screened with [tool name and version] operating at a threshold calibrated to ≥99% recall on a held-out validation sample of 412 records labeled by two reviewers (Cohen's κ = 0.86). At the operating threshold, sensitivity was 99.2% (95% CI 96.8–99.9%), specificity was 71.3%, and precision was 41%. AI-included records and a 10% random sample of AI-excluded records were full-reviewed by one reviewer; full-text screening was dual-reviewer per Cochrane Handbook 6.5. Override rate during deployment was 4.7% across 1,238 AI inclusions. Validation was repeated when the model was updated on [date].

That is one paragraph. It contains every number a peer reviewer needs to judge defensibility.

Putting it to work this week

Three things to do before your next screening kickoff:

Set the recall floor in the protocol explicitly. 99% for primary; 95% for scoping; 99.5% for high-stakes guidelines. Do not let it be implicit.
Build a 400-record labeled validation sample from your actual search. Calibrate the AI threshold against it. Read off precision. Decide on that basis.
Pre-register the override-rate red line. If override rate exceeds (commonly) 15–20% during deployment, the team agrees to abandon AI-assisted screening for the task and revert to manual.

The metrics are not complicated. The discipline is in setting them in advance and reporting them honestly. That is what makes AI-assisted screening defensible — not the tool, the numbers.