Back to blog

Why 99% Recall Is the Floor for Screening Automation (And What Precision Buys You)

A working explainer on the metrics that govern AI-assisted screening: why recall is the dominant constraint, why 99% has become the de facto floor, and what the precision/specificity numbers actually buy in workload reduction.

Mapped Methodology Team · Methodology Team
1 min read
screeningaimetricsrecallprecision

The conversation about AI screening tools is often framed in marketing language: "70% workload reduction," "smart screening," "AI-assisted prioritization." None of these mean anything until the underlying metrics are pinned down. This is a working explainer on the four numbers that actually decide whether an AI screening tool is safe, useful, and reportable in your review.

The metrics, defined for screening

Screening is binary classification: each record is judged relevant (positive) or not (negative). Four numbers describe how a classifier performs on a labeled validation sample.

MetricDefinitionWhat it measures
Recall (sensitivity)TP ÷ (TP + FN)Of all truly relevant records, what fraction did the classifier keep?
PrecisionTP ÷ (TP + FP)Of all records the classifier said are relevant, what fraction actually are?
SpecificityTN ÷ (TN + FP)Of all truly irrelevant records, what fraction did the classifier correctly exclude?
F12 × (precision × recall) ÷ (precision + recall)Single-number summary, dominated by whichever of precision/recall is lower.

A fifth quantity is operationally important: workload saving, sometimes WSS@N — the proportion of total records the classifier confidently excluded at recall N.

Three points are easy to lose if you have not built classifiers before:

  1. Recall and precision trade off as you move the operating threshold. Higher recall costs precision; higher precision costs recall.
  2. Specificity and precision are different. Specificity is a property of the negative class; precision is a property of the positive class. In a screening set with 5% prevalence, 95% specificity is poor (you have 5x as many false positives as true positives) while 95% precision is excellent.
  3. None of these numbers are intrinsic to a model. They are properties of model + threshold + dataset. Reporting recall without specifying threshold and dataset is uninformative.

Why recall is the floor

In screening, errors are asymmetric. A false negative removes a potentially relevant study from your synthesis. A false positive adds a record to a manual review queue that has to be cleared anyway. The first is essentially invisible — there is no downstream signal that a relevant study was missed unless you happen to find it in a hand-search. The second is loud: a reviewer sees the record, judges it, and excludes it.

This asymmetry is why every screening framework — from the Cochrane Handbook to PRISMA-S to the 2025 Cochrane AI position — anchors on recall.

The 99% threshold itself is a conservative reading of two facts.

The first is the Cochrane Handbook's expectation that dual-reviewer screening achieves ≥95% sensitivity (with the second reviewer catching what the first misses). The original empirical basis comes from Edwards et al. (2002) and reproduced in subsequent inter-rater reliability work, summarized in our explainer on inter-rater reliability for screening.

The second is the published evaluation literature on supervised screening classifiers. Marshall et al. (2018) and Khalil et al. (2022) showed that high-performing classifiers can sustain ≥99% sensitivity at meaningful precision in many topic areas — but that performance varies sharply by topic.

The argument is then conservative: an AI-assisted classifier should not perform below the human dual-reviewer baseline it replaces. 95% is the human floor; 99% is the safety margin most editorial review processes now expect for AI-assisted primary screening.

For broader-scope reviews where the universe is intentionally exploratory — scoping reviews, mapping reviews — 95% is defensible if the protocol commits to it in advance. For high-stakes clinical guideline reviews, 99.5% is increasingly common.

What precision actually buys you

Recall keeps you safe. Precision is what makes AI-assisted screening worth the setup cost. The arithmetic is simple but worth seeing.

Imagine a search returning 10,000 records with a true relevance rate of 5% (500 truly relevant records).

RecallPrecisionTrue positives keptFalse positives in queueTotal queue to clearWorkload saved
99%20%4951,9802,47575.3%
99%40%4957431,23887.6%
99%60%49533082591.8%
99%80%49512461993.8%

Each of these tools has identical safety properties. They differ entirely in how much queue the reviewer must clear. The "70% workload reduction" claim that vendors put on slide decks tends to assume a precision in the 25–35% range. The lived experience of a reviewer using such a tool is "I still screened a lot of records." Precision in the 50%+ range is what changes that experience.

This is also what number-needed-to-screen (NNS) measures: the average count of records a reviewer must screen to find one truly relevant record. NNS = 1 ÷ precision. At 20% precision, NNS = 5; at 60% precision, NNS = 1.67.

How to validate a screening tool on your topic

The Cochrane 2025 position requires per-topic validation. The mechanics are simple.

  1. Build a labeled sample. 200 records is the minimum credible sample; 400–500 is preferable. Records should be sampled from your actual search, not from the general literature. Two reviewers label them blind to each other; disagreements are adjudicated.
  2. Run the AI on the sample, blind to labels. The AI produces a relevance score for each record.
  3. Calibrate the threshold. Sort records by score. Find the threshold at which sensitivity reaches your protocol-specified floor (typically 99%). Read off the corresponding precision and specificity at that threshold.
  4. Decide. If precision at the chosen threshold is acceptable to you (most teams set 30%+ as the operational floor for value), deploy. If not, the tool fails the gate for this topic.
  5. Re-validate periodically. Re-run validation if the model is updated, if the search is significantly extended, or quarterly in long-running living reviews.

Validation must happen before the AI screens at scale. Validating during full screening is a sample-size confound — you are measuring on records the tool has already filtered.

For the broader decision of whether to use AI for screening at all in your review, see the three-axis decision framework. For the policy framework that requires this validation, see Responsible AI in Systematic Reviews.

What 99% recall does not mean

The conventional 99% recall floor is widely repeated and frequently misread. Three clarifications worth flagging in any methods-section discussion.

99% recall is not 99% accuracy. Accuracy weights both classes equally. In a 5%-prevalent screening set, a classifier that says "exclude" to everything achieves 95% accuracy and 0% recall. Accuracy is not a useful screening metric.

99% recall on validation is not 99% recall on the full set. Validation produces an estimate. The 95% confidence interval on a sensitivity estimate from 400 records with ~50 positives is roughly ±4 percentage points. The reported sensitivity is the point estimate; the protocol should commit to the lower bound of the interval as the operating expectation.

99% recall is not zero-loss. It is one-in-a-hundred loss. On a search of 10,000 records with 500 relevant, 99% recall means you accept missing roughly 5 relevant records. Whether that loss is acceptable depends on what those 5 records are likely to be. Spot-check sampling of AI-excluded records (5–10% of excludes, randomly drawn) is the standard mitigation.

How tools differ on these numbers

Most published evaluations of mainstream screening tools sit in similar ranges, with substantial topic-by-topic variance. Cohen et al. (2006), Wallace et al. (2010), Marshall et al. (2018), Howard et al. (2020), and Khalil et al. (2022) collectively support a band of ~85–99% sensitivity at precision in the 20–60% range for general screening tasks, depending on tool, topic, and threshold.

The relevant comparison is not between tools at advertised performance — it is between the same tool's performance in your topic, on your validation sample. The Marshall et al. (2018) corollary is that "validation transferability cannot be assumed across topics." Two tools with identical published numbers can perform very differently on a CVD intervention review, a rare-disease prognostic review, and a non-English-language qualitative synthesis.

This is the operational implication of the validation requirement that the 2025 Cochrane position made binding. The tools that handle this well are the ones that ship a calibration interface — meaning the user can run a pilot on their labeled sample and read off the operating characteristic before deploying. Mapped's screening pipeline implements this as a required first step; calibration must complete and be approved before AI screens any unseen record.

The shape of a defensible screening methods paragraph

A reportable screening section that satisfies PRISMA 2020, the 2025 Cochrane position, and RAISE looks roughly like this.

Screening. Records were screened with [tool name and version] operating at a threshold calibrated to ≥99% recall on a held-out validation sample of 412 records labeled by two reviewers (Cohen's κ = 0.86). At the operating threshold, sensitivity was 99.2% (95% CI 96.8–99.9%), specificity was 71.3%, and precision was 41%. AI-included records and a 10% random sample of AI-excluded records were full-reviewed by one reviewer; full-text screening was dual-reviewer per Cochrane Handbook 6.5. Override rate during deployment was 4.7% across 1,238 AI inclusions. Validation was repeated when the model was updated on [date].

That is one paragraph. It contains every number a peer reviewer needs to judge defensibility.

Putting it to work this week

Three things to do before your next screening kickoff:

  1. Set the recall floor in the protocol explicitly. 99% for primary; 95% for scoping; 99.5% for high-stakes guidelines. Do not let it be implicit.
  2. Build a 400-record labeled validation sample from your actual search. Calibrate the AI threshold against it. Read off precision. Decide on that basis.
  3. Pre-register the override-rate red line. If override rate exceeds (commonly) 15–20% during deployment, the team agrees to abandon AI-assisted screening for the task and revert to manual.

The metrics are not complicated. The discipline is in setting them in advance and reporting them honestly. That is what makes AI-assisted screening defensible — not the tool, the numbers.

Further reading

  • Cohen AM, Hersh WR, Peterson K, Yen P-Y. Reducing workload in systematic review preparation using automated citation classification. JAMIA, 2006.
  • Wallace BC, et al. Active learning for biomedical citation screening. KDD, 2010.
  • Marshall IJ, et al. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic Reviews, 2019.
  • Howard BE, et al. SWIFT-Active Screener: accelerated document screening through active learning and integrated recall estimation. Environment International, 2020.
  • Khalil H, et al. Tools to support the automation of systematic reviews: a scoping review. JCE, 2022.
  • Edwards P, et al. Identification of randomized controlled trials in systematic reviews: accuracy and reliability of screening records. Statistics in Medicine, 2002.

For the upstream policy framework, see Responsible AI in Systematic Reviews. For the per-task decision logic, see the three-axis framework. For the inter-rater reliability metrics that anchor the 95% human baseline, see Cohen's Kappa, Gwet's AC1, and What to Report.

Frequently asked questions

About the author

Mapped Methodology Team
Methodology Team · mapped

mapped is the AI research workspace for systematic reviews and meta-analyses. Our methodology team writes from inside live review workflows — no rephrased content, no theoretical posts.