The GRADE Framework: Rating Evidence from Very Low to High
The GRADE framework (Grading of Recommendations, Assessment, Development, and Evaluations) is the most widely adopted system for rating the certainty of evidence in systematic reviews. Cochrane requires it, the WHO uses it for clinical guidelines, and an increasing number of journals expect GRADE assessments as part of manuscript submission.
Despite its importance, GRADE is often misunderstood or applied inconsistently. This guide breaks down each component so you can apply it correctly in your own reviews.
What GRADE Does
GRADE answers one question: how confident are we that the true effect lies close to the estimate from the systematic review?
It rates the certainty of evidence for each outcome on a four-level scale:
| Level | Meaning |
|---|---|
| High | Very confident that the true effect lies close to the estimate |
| Moderate | Moderately confident; the true effect is likely close but could be substantially different |
| Low | Limited confidence; the true effect may be substantially different |
| Very Low | Very little confidence; the true effect is likely substantially different |
GRADE is applied per outcome, not per study. A single systematic review might rate the evidence for mortality as "Moderate" and for quality of life as "Very Low."
Starting Point
The starting certainty depends on the study design:
- Randomized controlled trials start at High
- Observational studies start at Low
From there, the rating can be downgraded (for limitations) or upgraded (for strengths).
Five Downgrade Domains
Each domain can reduce certainty by one or two levels.
1. Risk of Bias
If the included studies have methodological limitations (based on RoB 2 or ROBINS-I assessments), the certainty is downgraded. Common issues include:
- Lack of allocation concealment
- Unblinded outcome assessment for subjective outcomes
- High attrition rates
- Selective outcome reporting
Downgrade by one level for "serious" limitations, two levels for "very serious."
2. Inconsistency
Inconsistency refers to unexplained variability in results across studies. Indicators include:
- Large heterogeneity (I² > 50% or significant Q-test)
- Point estimates that vary widely
- Confidence intervals that do not overlap
- Subgroup analyses that do not explain the heterogeneity
If all studies point in the same direction but with different magnitudes, this is less concerning than studies pointing in opposite directions.
3. Indirectness
Indirectness occurs when the evidence does not directly match the review question. There are two types:
- Population/intervention/outcome indirectness: the studies used a surrogate outcome, a different population, or a different version of the intervention
- Comparison indirectness: the comparison is indirect (e.g., A vs B inferred from A vs C and B vs C, relevant in NMA)
4. Imprecision
Imprecision relates to the width of the confidence interval around the pooled effect estimate. Evidence is imprecise when:
- The confidence interval crosses the threshold for clinical significance (the "null effect" line and/or the minimum important difference)
- The total sample size is small (the "optimal information size" criterion)
- There are few events (for dichotomous outcomes)
The GRADE guidelines suggest downgrading if the optimal information size is not met, even if the confidence interval appears narrow.
5. Publication Bias
Publication bias occurs when studies with statistically significant or favorable results are more likely to be published. Indicators include:
- Funnel plot asymmetry
- Statistical tests for small-study effects (Egger's test)
- Discrepancies between registered protocols and published results
- Selective outcome reporting identified during risk of bias assessment
Publication bias is difficult to assess with fewer than 10 studies in the meta-analysis.
Three Upgrade Factors (Observational Studies)
While observational evidence starts at "Low," it can be upgraded when:
1. Large Magnitude of Effect
A large effect size (e.g., relative risk > 2 or < 0.5) that is consistent across studies suggests that the observed association is unlikely to be entirely due to confounding. Very large effects (RR > 5 or < 0.2) can warrant upgrading by two levels.
2. Dose-Response Gradient
A clear dose-response relationship — where increasing exposure is associated with increasing (or decreasing) effect — strengthens causal inference and can justify upgrading.
3. All Plausible Confounders Would Reduce the Effect
If all plausible confounders would bias the result toward the null (or in the opposite direction), yet the study still shows an effect, this increases confidence that the true effect exists.
Summary of Findings Tables
The Summary of Findings (SoF) table is the standard output of a GRADE assessment. It presents:
- Each critical and important outcome
- The number of studies and participants contributing to each outcome
- The pooled effect estimate with confidence interval
- The assumed and corresponding risk (for dichotomous outcomes)
- The GRADE certainty rating
- Brief comments explaining the rating
SoF tables are required by Cochrane and recommended by most guideline panels. They communicate the key findings of the review in a single, standardized format.
CINeMA: GRADE for Network Meta-Analysis
Standard GRADE was designed for pairwise comparisons. When conducting a network meta-analysis, the CINeMA (Confidence in Network Meta-Analysis) framework extends GRADE with six domains:
- Within-study bias (similar to risk of bias)
- Reporting bias (similar to publication bias)
- Indirectness (adapted for network comparisons)
- Imprecision (using NMA-specific thresholds)
- Heterogeneity (variability across direct comparisons)
- Incoherence (disagreement between direct and indirect evidence — unique to NMA)
CINeMA produces confidence ratings for each pairwise comparison in the network, which can be summarized in a league table annotated with confidence levels.
Common GRADE Mistakes
- Rating per study instead of per outcome — GRADE is applied to the body of evidence for each outcome, not to individual studies
- Double-counting risk of bias — if you downgrade for risk of bias at the GRADE level, do not also exclude high-risk studies from the meta-analysis (choose one approach)
- Ignoring imprecision when the result is statistically significant — a statistically significant result can still be imprecise if the confidence interval is wide
- Not distinguishing certainty from the direction of effect — "Very Low" certainty does not mean the treatment does not work; it means we are very uncertain about the estimate
- Failing to justify each downgrade — every downgrade decision should be explained in a footnote or accompanying text
GRADE Assessment in mapped
mapped integrates GRADE assessment as a dedicated workflow step for systematic reviews:
- For pairwise and prognostic reviews, the standard GRADE framework is available with all five downgrade domains and three upgrade factors
- For NMA reviews, CINeMA replaces standard GRADE, with all six NMA-specific domains
- For scoping reviews, GRADE is not included (quality assessment is not part of scoping review methodology)
- For DTA reviews, the GRADE-DTA adaptation is available
Risk of bias results from the preceding step feed directly into the GRADE assessment. Summary of Findings tables are generated automatically and can be exported for manuscript inclusion.
Further Reading
- Guyatt GH, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ, 2008.
- Schünemann HJ, et al. GRADE handbook. Cochrane Training, updated 2023.
- Nikolakopoulou A, et al. CINeMA: An approach for assessing confidence in the results of a network meta-analysis. PLoS Medicine, 2020.
- BMJ Core GRADE Series, 2025.