The GRADE Framework: Rating Evidence from Very Low to High

The GRADE framework (Grading of Recommendations, Assessment, Development, and Evaluations) is the most widely adopted system for rating the certainty of evidence in systematic reviews. Cochrane requires it, the WHO uses it for clinical guidelines, and an increasing number of journals expect GRADE assessments as part of manuscript submission.

Despite its importance, GRADE is often misunderstood or applied inconsistently. This guide breaks down each component so you can apply it correctly in your own reviews.

What GRADE Does

GRADE answers one question: how confident are we that the true effect lies close to the estimate from the systematic review?

It rates the certainty of evidence for each outcome on a four-level scale:

Level	Meaning
High	Very confident that the true effect lies close to the estimate
Moderate	Moderately confident; the true effect is likely close but could be substantially different
Low	Limited confidence; the true effect may be substantially different
Very Low	Very little confidence; the true effect is likely substantially different

GRADE is applied per outcome, not per study. A single systematic review might rate the evidence for mortality as "Moderate" and for quality of life as "Very Low."

Starting Point

The starting certainty depends on the study design:

Randomized controlled trials start at High
Observational studies start at Low

From there, the rating can be downgraded (for limitations) or upgraded (for strengths).

Five Downgrade Domains

Each domain can reduce certainty by one or two levels.

1. Risk of Bias

If the included studies have methodological limitations (based on RoB 2 or ROBINS-I assessments), the certainty is downgraded. Common issues include:

Lack of allocation concealment
Unblinded outcome assessment for subjective outcomes
High attrition rates
Selective outcome reporting

Downgrade by one level for "serious" limitations, two levels for "very serious."

2. Inconsistency

Inconsistency refers to unexplained variability in results across studies. Indicators include:

Large heterogeneity (I² > 50% or significant Q-test)
Point estimates that vary widely
Confidence intervals that do not overlap
Subgroup analyses that do not explain the heterogeneity

If all studies point in the same direction but with different magnitudes, this is less concerning than studies pointing in opposite directions.

3. Indirectness

Indirectness occurs when the evidence does not directly match the review question. There are two types:

Population/intervention/outcome indirectness: the studies used a surrogate outcome, a different population, or a different version of the intervention
Comparison indirectness: the comparison is indirect (e.g., A vs B inferred from A vs C and B vs C, relevant in NMA)

4. Imprecision

Imprecision relates to the width of the confidence interval around the pooled effect estimate. Evidence is imprecise when:

The confidence interval crosses the threshold for clinical significance (the "null effect" line and/or the minimum important difference)
The total sample size is small (the "optimal information size" criterion)
There are few events (for dichotomous outcomes)

The GRADE guidelines suggest downgrading if the optimal information size is not met, even if the confidence interval appears narrow.

5. Publication Bias

Publication bias occurs when studies with statistically significant or favorable results are more likely to be published. Indicators include:

Funnel plot asymmetry
Statistical tests for small-study effects (Egger's test)
Discrepancies between registered protocols and published results
Selective outcome reporting identified during risk of bias assessment

Publication bias is difficult to assess with fewer than 10 studies in the meta-analysis.

Three Upgrade Factors (Observational Studies)

While observational evidence starts at "Low," it can be upgraded when:

1. Large Magnitude of Effect

A large effect size (e.g., relative risk > 2 or < 0.5) that is consistent across studies suggests that the observed association is unlikely to be entirely due to confounding. Very large effects (RR > 5 or < 0.2) can warrant upgrading by two levels.

2. Dose-Response Gradient

A clear dose-response relationship — where increasing exposure is associated with increasing (or decreasing) effect — strengthens causal inference and can justify upgrading.

3. All Plausible Confounders Would Reduce the Effect

If all plausible confounders would bias the result toward the null (or in the opposite direction), yet the study still shows an effect, this increases confidence that the true effect exists.

Summary of Findings Tables

The Summary of Findings (SoF) table is the standard output of a GRADE assessment. It presents:

Each critical and important outcome
The number of studies and participants contributing to each outcome
The pooled effect estimate with confidence interval
The assumed and corresponding risk (for dichotomous outcomes)
The GRADE certainty rating
Brief comments explaining the rating

SoF tables are required by Cochrane and recommended by most guideline panels. They communicate the key findings of the review in a single, standardized format.

CINeMA: GRADE for Network Meta-Analysis

Standard GRADE was designed for pairwise comparisons. When conducting a network meta-analysis, the CINeMA (Confidence in Network Meta-Analysis) framework extends GRADE with six domains:

Within-study bias (similar to risk of bias)
Reporting bias (similar to publication bias)
Indirectness (adapted for network comparisons)
Imprecision (using NMA-specific thresholds)
Heterogeneity (variability across direct comparisons)
Incoherence (disagreement between direct and indirect evidence — unique to NMA)

CINeMA produces confidence ratings for each pairwise comparison in the network, which can be summarized in a league table annotated with confidence levels.

Common GRADE Mistakes

Rating per study instead of per outcome — GRADE is applied to the body of evidence for each outcome, not to individual studies
Double-counting risk of bias — if you downgrade for risk of bias at the GRADE level, do not also exclude high-risk studies from the meta-analysis (choose one approach)
Ignoring imprecision when the result is statistically significant — a statistically significant result can still be imprecise if the confidence interval is wide
Not distinguishing certainty from the direction of effect — "Very Low" certainty does not mean the treatment does not work; it means we are very uncertain about the estimate
Failing to justify each downgrade — every downgrade decision should be explained in a footnote or accompanying text

GRADE Assessment in mapped

mapped integrates GRADE assessment as a dedicated workflow step for systematic reviews:

For pairwise and prognostic reviews, the standard GRADE framework is available with all five downgrade domains and three upgrade factors
For NMA reviews, CINeMA replaces standard GRADE, with all six NMA-specific domains
For scoping reviews, GRADE is not included (quality assessment is not part of scoping review methodology)
For DTA reviews, the GRADE-DTA adaptation is available

Risk of bias results from the preceding step feed directly into the GRADE assessment. Summary of Findings tables are generated automatically and can be exported for manuscript inclusion.