AI Validation

We don't claim a fixed accuracy number.
We calibrate every review.

AI screening sensitivity is not a single number. It varies by topic, study type, language, and search composition. The 2025 Cochrane position is explicit: validation transferability cannot be assumed across topics. So instead of publishing one marketing figure, mapped runs a per-project calibration gate before AI scales — and logs every parameter to the audit trail.

Updated May 2026 · Aligned with Cochrane 2025 + RAISE

The calibration gate

What the product does instead

Before mapped's screening AI is allowed to scale on your records, you run a calibration pilot. We replicate the methodology argued in our Responsible AI guide: you pick the operating point, mapped logs the implied sensitivity, and the audit trail captures both. The marketing claim and the operational reality are the same text.

1
Upload labelled sample
200–500 records you've already screened
2
Mapped runs classifier
Across all three models, all thresholds
3
Operating curve shown
Recall / precision at each threshold
4
You pick the threshold
Meets your protocol's recall floor
5
Logged in audit trail
Threshold + sensitivity + override rate

Calibration runs once per project before AI screening starts at scale. The threshold you accept is the only operating point used afterwards — we don't silently switch defaults.

Honest claims

What we publish — and what we don't

The differentiator survives skeptical reading. Read both columns; if either feels wrong, tell us — this page changes before the product does.

Claims we don't make

A single fixed sensitivity / specificity number that holds across every review
“Validated for systematic reviews” as a blanket label
“Outperforms human reviewers” — no AI-only screening tool meets that bar today
Threshold-free accuracy figures (every screening claim depends on the operating point)
Cross-domain transferability — validation on cardiology data does not validate oncology screening

Claims we do make

A per-project calibration gate before AI runs at scale on your records
The threshold you choose, the sensitivity it implies, and the override rate during review — all logged in the audit trail
Auto-filled RAISE 17-item checklist on every manuscript export
Three-model comparison (Claude / GPT / Gemini) for tasks where consensus reduces disagreement risk
Pre-registered public benchmark on CSMeD + SYNERGY held-out partitions, due Q3 2026 (see roadmap)

Roadmap

Our public-validation roadmap

Concrete deliverables, with dates and OSF DOIs to follow. This is the page that tells you whether we shipped what we said we would; if a date here moves, this page moves first.

Q3 2026
Public benchmark — CSMeD + SYNERGY held-out partitions
Pre-registered protocol on OSF before the held-out set is opened. Sensitivity, specificity, F1, and Cohen's kappa at multiple thresholds, broken out by study type (intervention, prognostic, DTA, NMA, scoping) and by the three model providers we run.
Reported per task: title/abstract screening, full-text screening, RoB-domain tagging, and structured data extraction. Confidence intervals via bootstrap. Versioned PDF + reproducibility artefacts published from this page.
Quarterly
Re-run on new model versions
Each new Claude / GPT / Gemini release gets re-benchmarked against the same frozen partition. Deltas (and any regressions) published with the next quarterly report so customers can see the trajectory, not just a snapshot.
Frozen partition + identical prompts + identical thresholds. Any change to harness or prompts is a separate report so the model-version delta stays clean.
Continuous
Aggregate calibration data from real reviews
With explicit customer consent, we publish anonymized per-topic distributions of calibration sensitivity, override rates, and disagreement rates. This is the closest signal a prospective customer can get to “does this work for my domain?” short of running their own pilot.
No record-level data. No project metadata. Distributions only, aggregated across at least 30 projects per topic before publishing — small-cell suppression.

Reproducibility

How customers reproduce it

The product's audit trail is the validation surface. Every manuscript export from mapped includes the RAISE 17-item checklist auto-filled with that project's calibration result, threshold, override rate, and dispute resolution. So even before the public benchmark exists, every review run on mapped can be RAISE-reported today.

What gets logged automatically

• Model name and version (Claude / GPT / Gemini) per task
• Prompt template hash + parameters
• Threshold accepted at the calibration gate
• Sensitivity / specificity implied at that threshold
• Per-task human override rate during review
• Inter-rater agreement (Cohen's kappa) where applicable

What you receive on export

• RAISE 17-item checklist (PDF + machine-readable JSON)
• PRISMA 2020 “study selection” + “data collection” sections pre-filled
• Per-task validation log appended to the manuscript
• Anonymised reproducibility bundle (calibration set + thresholds)

Want the full audit-trail walkthrough? It lives inside the product. Calibration is on the screening step; the RAISE checklist appears on the manuscript export step.

See it in the workflow

References

We don't claim a fixed accuracy number.We calibrate every review.

Claims we don't make

Claims we do make

Public benchmark — CSMeD + SYNERGY held-out partitions

Re-run on new model versions

Aggregate calibration data from real reviews

What gets logged automatically

What you receive on export

We don't claim a fixed accuracy number.
We calibrate every review.