Back to blog

Cohen's Kappa, Gwet's AC1, and What to Report for Screening Reliability

A working stats explainer on inter-rater reliability for systematic review screening: when Cohen's kappa fails (the kappa paradox), why Gwet's AC1 fixes it, and what to report in the methods section to satisfy a 2026 reviewer.

Mapped Methodology Team · Methodology Team
1 min read
inter-rater-reliabilitykappagwetscreeningstatistics

Inter-rater reliability is one of those review steps that is straightforward to perform and easy to report wrongly. Most systematic reviews include a Cohen's kappa value somewhere in the methods. Roughly half of those values are misleading — not because the calculation is wrong, but because Cohen's kappa is the wrong statistic for the data being reported.

This is a working explainer on what the screening reliability literature actually says, what to compute, and what to report so a 2026 methodology peer reviewer accepts the analysis.

The setup: why we measure reliability at all

When two reviewers screen the same records, the question is not "do they agree?" — they will agree most of the time, because most records are obvious excludes. The question is "do they agree more than chance would predict?"

Inter-rater reliability statistics try to subtract chance agreement and report the residual: agreement that reflects shared methodology and judgment, not random luck.

For screening specifically, reliability matters for three reasons.

  1. Calibration. Before full screening starts, two reviewers screen a small sample (typically 100–200 records) to surface disagreements about the inclusion criteria. Low reliability here means the criteria need refinement — not that the reviewers are weak.
  2. Audit. During screening, periodic reliability checks catch drift. If reliability degrades, the reviewers may have started to interpret criteria differently.
  3. Reporting. PRISMA 2020 and journal reviewers expect a reliability statistic in the methods section to evidence that screening was performed by trained reviewers applying consistent criteria.

The statistics chosen for these jobs matter.

Cohen's kappa, defined

Cohen's kappa (κ) is the conventional inter-rater reliability statistic for two raters applying a categorical (often binary) judgment.

The formula is:

κ = (Po − Pe) ÷ (1 − Pe)

Where:

  • Po is the observed proportion of agreement
  • Pe is the proportion of agreement expected by chance, given each rater's marginal frequencies

For a 2×2 binary screening table:

Reviewer B: includeReviewer B: excludeTotal
Reviewer A: includeTPFN_An₁
Reviewer A: excludeFP_ATNn₂
Totalm₁m₂N

Where Po = (TP + TN) / N, and Pe is computed from the marginals.

Landis and Koch (1977) proposed interpretive thresholds: <0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. These thresholds are widely cited but, per Sim and Wright (2005) and McHugh (2012), should be treated as orientation, not gates.

The kappa paradox

The kappa paradox is the case where two reviewers agree on 95%+ of decisions and yet kappa is low — sometimes near zero, sometimes negative.

It happens because Pe (expected chance agreement) becomes very large when one class is dominant. In a screening set with 5% prevalence, even random raters would agree about 90% of the time by always excluding. There is little room for kappa's "subtract chance" mechanism to find signal.

Feinstein and Cicchetti (1990) presented the paradox formally with worked examples. Byrt et al. (1993) extended the analysis. The conclusion across this literature is consistent: Cohen's kappa is unreliable as a standalone statistic when class prevalence is highly imbalanced.

A worked example, taken from a typical screening calibration:

Reviewer B: includeReviewer B: excludeTotal
Reviewer A: include8210
Reviewer A: exclude3187190
Total11189200
  • Po = (8 + 187) / 200 = 0.975 (97.5% raw agreement)
  • Pe = (10/200 × 11/200) + (190/200 × 189/200) = 0.0028 + 0.898 = 0.901
  • κ = (0.975 − 0.901) / (1 − 0.901) = 0.747

Substantial agreement by Landis and Koch — barely. The two reviewers agreed on 195 out of 200 records, but kappa landed at 0.747 because chance agreement was so high.

Now imagine the same raw agreement (97.5%) but with all 5 disagreements concentrated in the include cell:

Reviewer B: includeReviewer B: excludeTotal
Reviewer A: include5510
Reviewer A: exclude0190190
Total5195200
  • Po = 0.975
  • Pe = (10/200 × 5/200) + (190/200 × 195/200) = 0.00125 + 0.926 = 0.928
  • κ = (0.975 − 0.928) / (1 − 0.928) = 0.653

Same percent agreement, lower kappa, because the marginals are even more imbalanced. The statistic has not measured anything about reviewer quality — only about prevalence.

Gwet's AC1 — designed for this case

Gwet (2002, 2008) introduced AC1 (Agreement Coefficient 1) as a chance-corrected agreement statistic that does not collapse under prevalence imbalance. The intuition is that Gwet's AC1 estimates Pe based on the probability that a rater is making a "random" judgment versus a "considered" judgment, and adjusts accordingly.

For binary classification:

AC1 = (Po − Pe(γ)) ÷ (1 − Pe(γ))

Where Pe(γ) = 2π(1 − π), and π is the average proportion classified as positive across raters.

The same two examples above:

  • Example 1 (5 disagreements split 2/3): π ≈ 0.0525, Pe(γ) ≈ 0.0995, AC1 = (0.975 − 0.0995) / (1 − 0.0995) ≈ 0.972
  • Example 2 (5 disagreements all in include cell): π ≈ 0.0375, Pe(γ) ≈ 0.0722, AC1 = (0.975 − 0.0722) / (1 − 0.0722) ≈ 0.972

Same Po, same AC1. The statistic now reflects what reviewers actually did — agree on 97.5% of records — without being distorted by the imbalanced marginals.

Wongpakaran et al. (2013) compared AC1 against kappa across multiple datasets and concluded that AC1 is the more stable statistic when prevalence is not balanced. The Cochrane Methods Group and several recent methodological reviews have begun recommending AC1 as a default for screening reliability — though kappa remains the convention readers expect to see.

What to actually report

A defensible methods-section paragraph for screening reliability looks roughly like this:

Inter-rater reliability. Two reviewers (A and B) independently screened a calibration sample of 200 records randomly drawn from the search. Raw percent agreement was 97.5%. Cohen's κ was 0.75 (95% CI 0.59–0.91). Gwet's AC1 was 0.97 (95% CI 0.94–1.00). Marginal inclusion prevalence was 0.0525. Disagreements were resolved by discussion before full screening; criteria were not amended.

Five elements, none redundant.

  • Raw percent agreement — the unprocessed observation. Always defensible.
  • Cohen's kappa with CI — the convention. Reviewers expect it.
  • Gwet's AC1 with CI — the prevalence-stable alternative. Increasingly expected.
  • Marginal prevalence — the diagnostic that explains any divergence between κ and AC1.
  • Resolution mechanism — how disagreements were handled, and whether criteria changed.

This is not over-reporting. It is exactly the information a methodology reviewer needs to interpret the numbers honestly.

Calibration sample size

The standard advice is 100–200 records for the calibration IRR check. Smaller samples produce confidence intervals wide enough that "moderate" and "substantial" agreement are statistically indistinguishable.

For low-prevalence screening sets, the relevant sample size depends not on N but on the count of positives. With 5% prevalence and N = 100, you get ~5 expected positives — too few to inform a reliable κ on the include cell. Two practical responses:

  1. Oversample positives. If you have a labeled set from a prior pilot or external benchmark, draw the calibration sample to ensure 30+ positives. Adjust the kappa/AC1 calculation accordingly (or report unadjusted with a note that the sample was stratified).
  2. Use larger samples. 400–500 records is increasingly common in 2026 reviews where AI-assisted screening calibration overlaps with IRR calibration. The AI calibration sample can serve double duty.

How AI changes the picture

When AI is one of the "raters" in a workflow, the same statistics apply — but the framing shifts.

Two patterns are emerging in 2026 protocols.

Pattern A: human–human IRR for calibration; AI–human override rate during deployment. Two reviewers calibrate on 200 records; report κ and AC1. The AI then screens at scale; report the override rate (proportion of AI judgments a human reversed). This is the cleanest pattern and the one the screening recall/precision explainer presupposes.

Pattern B: AI–human IRR alongside human–human. Two reviewers and the AI all judge the same calibration sample. Report three pairwise κ/AC1 values: A vs B, A vs AI, B vs AI. This pattern is more demanding but produces a richer reliability picture, and it is the standard for single-vs-dual-reviewer comparisons when AI is meant to substitute for one human.

Either pattern is defensible if specified in the protocol. The 2025 Cochrane AI position (see Responsible AI in Systematic Reviews) is silent on the choice but explicit that some reliability metric must be reported for AI-assisted tasks.

Common failure modes

Three patterns we see in submitted methods sections.

"Cohen's κ = 0.42, considered moderate agreement (Landis and Koch)." The 0.42 number means little without prevalence and raw agreement. If prevalence is 0.05, the same agreement rate could correspond to AC1 = 0.95 — a very different conclusion.

"Excellent agreement (κ > 0.80)." Without confidence intervals or sample size, "excellent" is rhetorical. Report κ ± CI; let readers compute their own adjective.

"IRR was assessed by spot-checking 5% of records." Spot-checking is not IRR. IRR is a structured comparison on the same records by the same raters. Spot-checking is a quality-assurance pattern; report it as such, not as IRR.

Putting it to work this week

Three concrete steps before your next screening kickoff:

  1. Draw a calibration sample of at least 200 records, stratified to include 30+ positives if possible.
  2. Compute and report all five metrics: percent agreement, κ, AC1, prevalence, resolution mechanism. Use the R packages irr (kappa) and irrCAC (AC1), or equivalent in Python (statsmodels.stats.inter_rater).
  3. Pre-register your reliability acceptance threshold in the protocol — both the κ and the AC1 floor — so the team is not negotiating it after seeing the numbers.

The statistics are simple. The discipline is in choosing the right one and reporting it honestly.

Further reading

  • Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960.
  • Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 1990.
  • Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. Journal of Clinical Epidemiology, 1993.
  • Gwet KL. Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical Methods for Inter-Rater Reliability Assessment, 2002. (And the 2008 BMC paper.)
  • Wongpakaran N, Wongpakaran T, Wedding D, Gwet KL. A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Medical Research Methodology, 2013.
  • McHugh ML. Interrater reliability: the kappa statistic. Biochemia Medica, 2012.
  • Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy, 2005.

For the screening-metric layer reliability supports, see Why 99% Recall Is the Floor. For the single-vs-dual-reviewer methodological choice this statistic informs, see Single-Reviewer vs Dual-Reviewer Screening. For the policy framework that increasingly requires AI-related reliability metrics, see Responsible AI in Systematic Reviews.

Frequently asked questions

About the author

Mapped Methodology Team
Methodology Team · mapped

mapped is the AI research workspace for systematic reviews and meta-analyses. Our methodology team writes from inside live review workflows — no rephrased content, no theoretical posts.