We don't claim a fixed accuracy number.
We calibrate every review.
AI screening sensitivity is not a single number. It varies by topic, study type, language, and search composition. The 2025 Cochrane position is explicit: validation transferability cannot be assumed across topics. So instead of publishing one marketing figure, mapped runs a per-project calibration gate before AI scales — and logs every parameter to the audit trail.
Updated May 2026 · Aligned with Cochrane 2025 + RAISE
What the product does instead
Before mapped's screening AI is allowed to scale on your records, you run a calibration pilot. We replicate the methodology argued in our Responsible AI guide: you pick the operating point, mapped logs the implied sensitivity, and the audit trail captures both. The marketing claim and the operational reality are the same text.
- 1
Upload labelled sample
200–500 records you've already screened
- 2
Mapped runs classifier
Across all three models, all thresholds
- 3
Operating curve shown
Recall / precision at each threshold
- 4
You pick the threshold
Meets your protocol's recall floor
- 5
Logged in audit trail
Threshold + sensitivity + override rate
Calibration runs once per project before AI screening starts at scale. The threshold you accept is the only operating point used afterwards — we don't silently switch defaults.
What we publish — and what we don't
The differentiator survives skeptical reading. Read both columns; if either feels wrong, tell us — this page changes before the product does.
Claims we don't make
- A single fixed sensitivity / specificity number that holds across every review
- “Validated for systematic reviews” as a blanket label
- “Outperforms human reviewers” — no AI-only screening tool meets that bar today
- Threshold-free accuracy figures (every screening claim depends on the operating point)
- Cross-domain transferability — validation on cardiology data does not validate oncology screening
Claims we do make
- A per-project calibration gate before AI runs at scale on your records
- The threshold you choose, the sensitivity it implies, and the override rate during review — all logged in the audit trail
- Auto-filled RAISE 17-item checklist on every manuscript export
- Three-model comparison (Claude / GPT / Gemini) for tasks where consensus reduces disagreement risk
- Pre-registered public benchmark on CSMeD + SYNERGY held-out partitions, due Q3 2026 (see roadmap)
Our public-validation roadmap
Concrete deliverables, with dates and OSF DOIs to follow. This is the page that tells you whether we shipped what we said we would; if a date here moves, this page moves first.
- Q3 2026
Public benchmark — CSMeD + SYNERGY held-out partitions
Pre-registered protocol on OSF before the held-out set is opened. Sensitivity, specificity, F1, and Cohen's kappa at multiple thresholds, broken out by study type (intervention, prognostic, DTA, NMA, scoping) and by the three model providers we run.
Reported per task: title/abstract screening, full-text screening, RoB-domain tagging, and structured data extraction. Confidence intervals via bootstrap. Versioned PDF + reproducibility artefacts published from this page.
- Quarterly
Re-run on new model versions
Each new Claude / GPT / Gemini release gets re-benchmarked against the same frozen partition. Deltas (and any regressions) published with the next quarterly report so customers can see the trajectory, not just a snapshot.
Frozen partition + identical prompts + identical thresholds. Any change to harness or prompts is a separate report so the model-version delta stays clean.
- Continuous
Aggregate calibration data from real reviews
With explicit customer consent, we publish anonymized per-topic distributions of calibration sensitivity, override rates, and disagreement rates. This is the closest signal a prospective customer can get to “does this work for my domain?” short of running their own pilot.
No record-level data. No project metadata. Distributions only, aggregated across at least 30 projects per topic before publishing — small-cell suppression.
How customers reproduce it
The product's audit trail is the validation surface. Every manuscript export from mapped includes the RAISE 17-item checklist auto-filled with that project's calibration result, threshold, override rate, and dispute resolution. So even before the public benchmark exists, every review run on mapped can be RAISE-reported today.
What gets logged automatically
- • Model name and version (Claude / GPT / Gemini) per task
- • Prompt template hash + parameters
- • Threshold accepted at the calibration gate
- • Sensitivity / specificity implied at that threshold
- • Per-task human override rate during review
- • Inter-rater agreement (Cohen's kappa) where applicable
What you receive on export
- • RAISE 17-item checklist (PDF + machine-readable JSON)
- • PRISMA 2020 “study selection” + “data collection” sections pre-filled
- • Per-task validation log appended to the manuscript
- • Anonymised reproducibility bundle (calibration set + thresholds)
Want the full audit-trail walkthrough? It lives inside the product. Calibration is on the screening step; the RAISE checklist appears on the manuscript export step.
See it in the workflowReferences
- RAISE — Responsible AI in Evidence Synthesis recommendationsBond University et al. — 17-item reporting checklist for AI-assisted reviews.
- Cochrane — Statement on the use of AI in evidence synthesis (2025)Conditions under which AI is permitted in Cochrane reviews.
- Marshall & Wallace (2019) — Toward systematic review automationFoundational survey of AI-assisted screening performance.
- Khalil et al. (2022) — Tools to support the automation of SRsEvaluation of automation tools across the SR workflow.
- Responsible AI in Systematic Reviews — methodology guideOur working interpretation of Cochrane 2025 + RAISE.
- AI screening — recall, precision, and the threshold trade-offWhy a single sensitivity number is methodologically misleading.
- AI decision matrix for systematic reviewsWhere AI helps, where it shouldn't, and what to validate per task.