What is the kappa paradox?

The kappa paradox is the well-documented case where two reviewers agree on 95%+ of decisions yet Cohen's kappa is low (sometimes near zero). It happens when one class is very rare — typical for systematic review screening, where 95–98% of records are excluded. High agreement on the dominant class means there's little chance variability for kappa to subtract, and the statistic collapses. Feinstein and Cicchetti formalized the paradox in 1990; Gwet's AC1 (2002) was designed specifically to fix it.

Should I report Cohen's kappa or Gwet's AC1?

Both, with prevalence and percent agreement. Cohen's kappa is the convention reviewers expect; Gwet's AC1 is the statistic that does not collapse under prevalence imbalance. Reporting both — alongside raw percent agreement and the prevalence/marginal proportions — lets readers diagnose any disagreement and judge reliability honestly. Reporting kappa alone in a low-prevalence screening set is misleading.

What kappa is acceptable for screening?

Landis and Koch (1977) proposed >0.80 as 'almost perfect,' 0.61–0.80 as 'substantial,' 0.41–0.60 as 'moderate.' Methodologists since Sim and Wright (2005) have argued these thresholds are arbitrary and prevalence-dependent. Use them as orientation, not gates. The defensible practice in 2026 is to report κ, AC1, raw agreement, and prevalence — and let the methodology section explain the chosen interpretation.

How big should the IRR sample be?

100–200 records is the conventional range for a calibration IRR check before full screening. Smaller samples produce confidence intervals so wide they cannot distinguish 'substantial' from 'poor' agreement. For low-prevalence sets where positives are sparse, oversample positives to ensure the IRR statistic is informative on the relevant class — not just on the easy 'exclude all' decisions.

Does AI-assisted screening change inter-rater reliability requirements?

Yes. Under the 2025 Cochrane AI position and RAISE, the AI–human override rate is a reliability metric that lives alongside κ. Some teams report κ between AI and each reviewer separately; others report human–human κ for the calibration sample plus AI override rate during deployment. Both are defensible; the choice should be specified in the protocol.

Cohen's Kappa, Gwet's AC1, and What to Report for Screening Reliability

Inter-rater reliability is one of those review steps that is straightforward to perform and easy to report wrongly. Most systematic reviews include a Cohen's kappa value somewhere in the methods. Roughly half of those values are misleading — not because the calculation is wrong, but because Cohen's kappa is the wrong statistic for the data being reported.

This is a working explainer on what the screening reliability literature actually says, what to compute, and what to report so a 2026 methodology peer reviewer accepts the analysis.

The setup: why we measure reliability at all

When two reviewers screen the same records, the question is not "do they agree?" — they will agree most of the time, because most records are obvious excludes. The question is "do they agree more than chance would predict?"

Inter-rater reliability statistics try to subtract chance agreement and report the residual: agreement that reflects shared methodology and judgment, not random luck.

For screening specifically, reliability matters for three reasons.

Calibration. Before full screening starts, two reviewers screen a small sample (typically 100–200 records) to surface disagreements about the inclusion criteria. Low reliability here means the criteria need refinement — not that the reviewers are weak.
Audit. During screening, periodic reliability checks catch drift. If reliability degrades, the reviewers may have started to interpret criteria differently.
Reporting. PRISMA 2020 and journal reviewers expect a reliability statistic in the methods section to evidence that screening was performed by trained reviewers applying consistent criteria.

The statistics chosen for these jobs matter.

Cohen's kappa, defined

Cohen's kappa (κ) is the conventional inter-rater reliability statistic for two raters applying a categorical (often binary) judgment.

The formula is:

κ = (Po − Pe) ÷ (1 − Pe)

Where:

Po is the observed proportion of agreement
Pe is the proportion of agreement expected by chance, given each rater's marginal frequencies

For a 2×2 binary screening table:

	Reviewer B: include	Reviewer B: exclude	Total
Reviewer A: include	TP	FN_A	n₁
Reviewer A: exclude	FP_A	TN	n₂
Total	m₁	m₂	N

Where Po = (TP + TN) / N, and Pe is computed from the marginals.

Landis and Koch (1977) proposed interpretive thresholds: <0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. These thresholds are widely cited but, per Sim and Wright (2005) and McHugh (2012), should be treated as orientation, not gates.

The kappa paradox

The kappa paradox is the case where two reviewers agree on 95%+ of decisions and yet kappa is low — sometimes near zero, sometimes negative.

It happens because Pe (expected chance agreement) becomes very large when one class is dominant. In a screening set with 5% prevalence, even random raters would agree about 90% of the time by always excluding. There is little room for kappa's "subtract chance" mechanism to find signal.

Feinstein and Cicchetti (1990) presented the paradox formally with worked examples. Byrt et al. (1993) extended the analysis. The conclusion across this literature is consistent: Cohen's kappa is unreliable as a standalone statistic when class prevalence is highly imbalanced.

A worked example, taken from a typical screening calibration:

	Reviewer B: include	Reviewer B: exclude	Total
Reviewer A: include	8	2	10
Reviewer A: exclude	3	187	190
Total	11	189	200

Po = (8 + 187) / 200 = 0.975 (97.5% raw agreement)
Pe = (10/200 × 11/200) + (190/200 × 189/200) = 0.0028 + 0.898 = 0.901
κ = (0.975 − 0.901) / (1 − 0.901) = 0.747

Substantial agreement by Landis and Koch — barely. The two reviewers agreed on 195 out of 200 records, but kappa landed at 0.747 because chance agreement was so high.

Now imagine the same raw agreement (97.5%) but with all 5 disagreements concentrated in the include cell:

	Reviewer B: include	Reviewer B: exclude	Total
Reviewer A: include	5	5	10
Reviewer A: exclude	0	190	190
Total	5	195	200

Po = 0.975
Pe = (10/200 × 5/200) + (190/200 × 195/200) = 0.00125 + 0.926 = 0.928
κ = (0.975 − 0.928) / (1 − 0.928) = 0.653

Same percent agreement, lower kappa, because the marginals are even more imbalanced. The statistic has not measured anything about reviewer quality — only about prevalence.

Gwet's AC1 — designed for this case

Gwet (2002, 2008) introduced AC1 (Agreement Coefficient 1) as a chance-corrected agreement statistic that does not collapse under prevalence imbalance. The intuition is that Gwet's AC1 estimates Pe based on the probability that a rater is making a "random" judgment versus a "considered" judgment, and adjusts accordingly.

For binary classification:

AC1 = (Po − Pe(γ)) ÷ (1 − Pe(γ))

Where Pe(γ) = 2π(1 − π), and π is the average proportion classified as positive across raters.

The same two examples above:

Example 1 (5 disagreements split 2/3): π ≈ 0.0525, Pe(γ) ≈ 0.0995, AC1 = (0.975 − 0.0995) / (1 − 0.0995) ≈ 0.972
Example 2 (5 disagreements all in include cell): π ≈ 0.0375, Pe(γ) ≈ 0.0722, AC1 = (0.975 − 0.0722) / (1 − 0.0722) ≈ 0.972

Same Po, same AC1. The statistic now reflects what reviewers actually did — agree on 97.5% of records — without being distorted by the imbalanced marginals.

Wongpakaran et al. (2013) compared AC1 against kappa across multiple datasets and concluded that AC1 is the more stable statistic when prevalence is not balanced. The Cochrane Methods Group and several recent methodological reviews have begun recommending AC1 as a default for screening reliability — though kappa remains the convention readers expect to see.

What to actually report

A defensible methods-section paragraph for screening reliability looks roughly like this:

Inter-rater reliability. Two reviewers (A and B) independently screened a calibration sample of 200 records randomly drawn from the search. Raw percent agreement was 97.5%. Cohen's κ was 0.75 (95% CI 0.59–0.91). Gwet's AC1 was 0.97 (95% CI 0.94–1.00). Marginal inclusion prevalence was 0.0525. Disagreements were resolved by discussion before full screening; criteria were not amended.

Five elements, none redundant.

Raw percent agreement — the unprocessed observation. Always defensible.
Cohen's kappa with CI — the convention. Reviewers expect it.
Gwet's AC1 with CI — the prevalence-stable alternative. Increasingly expected.
Marginal prevalence — the diagnostic that explains any divergence between κ and AC1.
Resolution mechanism — how disagreements were handled, and whether criteria changed.

This is not over-reporting. It is exactly the information a methodology reviewer needs to interpret the numbers honestly.

Calibration sample size

The standard advice is 100–200 records for the calibration IRR check. Smaller samples produce confidence intervals wide enough that "moderate" and "substantial" agreement are statistically indistinguishable.

For low-prevalence screening sets, the relevant sample size depends not on N but on the count of positives. With 5% prevalence and N = 100, you get ~5 expected positives — too few to inform a reliable κ on the include cell. Two practical responses:

Oversample positives. If you have a labeled set from a prior pilot or external benchmark, draw the calibration sample to ensure 30+ positives. Adjust the kappa/AC1 calculation accordingly (or report unadjusted with a note that the sample was stratified).
Use larger samples. 400–500 records is increasingly common in 2026 reviews where AI-assisted screening calibration overlaps with IRR calibration. The AI calibration sample can serve double duty.

How AI changes the picture

When AI is one of the "raters" in a workflow, the same statistics apply — but the framing shifts.

Two patterns are emerging in 2026 protocols.

Pattern A: human–human IRR for calibration; AI–human override rate during deployment. Two reviewers calibrate on 200 records; report κ and AC1. The AI then screens at scale; report the override rate (proportion of AI judgments a human reversed). This is the cleanest pattern and the one the screening recall/precision explainer presupposes.

Pattern B: AI–human IRR alongside human–human. Two reviewers and the AI all judge the same calibration sample. Report three pairwise κ/AC1 values: A vs B, A vs AI, B vs AI. This pattern is more demanding but produces a richer reliability picture, and it is the standard for single-vs-dual-reviewer comparisons when AI is meant to substitute for one human.

Either pattern is defensible if specified in the protocol. The 2025 Cochrane AI position (see Responsible AI in Systematic Reviews) is silent on the choice but explicit that some reliability metric must be reported for AI-assisted tasks.

Common failure modes

Three patterns we see in submitted methods sections.

"Cohen's κ = 0.42, considered moderate agreement (Landis and Koch)." The 0.42 number means little without prevalence and raw agreement. If prevalence is 0.05, the same agreement rate could correspond to AC1 = 0.95 — a very different conclusion.

"Excellent agreement (κ > 0.80)." Without confidence intervals or sample size, "excellent" is rhetorical. Report κ ± CI; let readers compute their own adjective.

"IRR was assessed by spot-checking 5% of records." Spot-checking is not IRR. IRR is a structured comparison on the same records by the same raters. Spot-checking is a quality-assurance pattern; report it as such, not as IRR.

Putting it to work this week

Three concrete steps before your next screening kickoff:

Draw a calibration sample of at least 200 records, stratified to include 30+ positives if possible.
Compute and report all five metrics: percent agreement, κ, AC1, prevalence, resolution mechanism. Use the R packages irr (kappa) and irrCAC (AC1), or equivalent in Python (statsmodels.stats.inter_rater).
Pre-register your reliability acceptance threshold in the protocol — both the κ and the AC1 floor — so the team is not negotiating it after seeing the numbers.

The statistics are simple. The discipline is in choosing the right one and reporting it honestly.