Grading Evidence
This is a simple way to rate the quality of clinical trials from 1-10. When the meter falls below five, we’re usually dealing with observational data that is open to interpretation.

Randomized Controlled Trials
Randomized trials start at a higher level (10) than observational studies (1-5), but they can have flaws that move their ranking down.
- Start with 10
- For each flaw below, subtract the number in parentheses
The first four are serious flaws, so we’ll always subtract something. Subtract 2 points if you are pretty sure it could change the conclusion, and 1 point if you’re not as sure.
No placebo (1-4)?
When a new treatment works as well as a standard one, it looks pretty good. But wait. In psychiatry, the placebo is often potent enough to overpower standard treatments. Think of all the negative SSRIs trials. Unless a placebo was around to ensure the reliability of the trial, we are left with doubt. Case in point: SAMe consistently equals antidepressants, but largely failed against placebo.
Big problem (3-4): Most trials in psychiatry.
Small problem (1-2): If the outcome is objective (eg, lab testing) or the disorder has a low placebo response (eg, tardive dyskinesia, schizophrenia, OCD, bipolar mania), or the treatment they are testing surpassed the control.
No primary outcome (1-4)?
A quality trial states the primary outcome in advance. If it is not significant (p < 0.05), the trial is negative.
Suspect trials don’t name a primary outcome, and instead list a scatter plot of many outcomes. The more outcomes they test, the more likely a false-positive result, rendering the p-value useless. This kind of “p-hacking” or “data fishing” makes the results about as reliable as an observational study.
You can adjust for multiple outcomes with stricter cut-off for the p value. The most conservative method (the Bonferroni correction) divides the p cut-off (0.05) by the number of outcomes (eg, trials with 2 outcomes need a p value < 0.025). Other methods take into account the fact that many outcomes are correlated (eg, if a patient scores high on one depression scale they’re likely to score high on another).
Really big problem (make it uncontrolled): There’s enough (> 50%) negative outcomes to make the trial inconclusive. Rate it as an uncontrolled case series (2/10) and subtract no more.
Big problem (3-4): There’s enough (30-50%) negative outcomes in the trial to make the conclusion suspect.
Small problem (1-2): The majority of the outcomes are positive, even if the primary outcome missed the mark; or the p values are small enough to survive the Bonferroni correction.
Small size (1-3)?
Small trials are more likely to turn up false results. That includes positive and negative results, but only the false-positives are likely to get published.
The ideal size depends on how potent the treatment is (effect size) and how much certainty (power) you need.
| Effect Size | Power: 70% | 80% | 90% |
|---|---|---|---|
| Small (0.2) | 618 | 786 | 1,050 |
| Medium (0.5) | 100 | 126 | 168 |
| Large (0.8) | 40 | 50 | 66 |
If you don’t know the effect size, use 0.5, which is where the average psychiatric treatment falls. For a 0.5 effect, this chart suggests we’re 70% confident in the trial’s conclusions if the sample is at least 100 (Serdar CC et al, 2020).
This holds up if we’re testing two arms, like lithium vs. placebo, but we’ll need a larger trial to test more treatments (eg, a sample of 200 instead of 100 if testing 4 treatments instead of 2). Studies that use uneven allocation also need larger samples (eg, putting fewer patients in the placebo group).
Big problem (3): Sample size is less than 40 (or less than half the ideal size for 70% power).
Small problem (1-2): Sample size is more than 40 (or more than half the ideal size for 70% power).
Not blinded (1-2)?
Studies can blind various parties to the treatment: patients, their clinicians, the raters, and even an external board that monitors the study. Blinding is particularly important with subjective outcomes, as it is for the rating scales in most psychiatric studies.
Blinding is not possible with some treatments (psychotherapy, TMS, diet). In that case, we need a fair comparison, like a sham (fake) lightbox, or supportive psychotherapy that balances out the human interaction common to all therapies. Unfair and unreliable comparisons include wait list control and treatment as usual.
Another problem is functional unblinding. Ideally, no one can tell who got the placebo and who got the med. This ideal is rarely met. In depression, the accurate guess rate is 60% for SSRIs, 70-80% for esketamine, and >90% for psychedelics.
Big problem (2): For example, if no one was blind to the treatment, or if a wait list control was compared to psychotherapy.
Small problem (1): For example, a single-blind medication trial.
Cross-over trial (2-3)?
Cross-over trials allow investigators to enroll fewer subjects without compromising power, but they are less reliable than “parallel-group” trials:
- Carry over effects from previous treatment
- Fluctuations in the course of illness invalidate the results, so cross-over is only useful for chronic, stable diseases
Subtract 2-3 depending on how likely these problems were to influence the results.
Drop outs (1-4)?
This rating depends on 1) Percent of drop outs (< 20% ideal), 2) How evenly they were distributed between placebo/med, and 3) How they were handled. Your final score (1-4) is a judgment call based on how likely they affected the conclusion.
People drop out of trials for a reason. They are likely the more severe cases, or the ones who get worse (or no better) in the trial. Leaving them out of the final analysis biases the results. If they did not use an “intent to treat analysis,” downgrade the trial.
- Ideal: Full intent-to-treat analysis
- Good enough: Modified intent-to-treat analysis
- Problematic: “completer,” “complete case,” or “available case” analyses (evaluating only participants who complete the study), or “per protocol” analyses (only participants who completed all key study procedures)
Even when dropouts are accounted for, sometimes their sheer number makes the trial too uncertain. An arbitrary cut-off for the ideal is < 20% drop outs. Long-term trials and those involving unstable patients like borderline or substance use disorders will often cross that threshold.
- Ideal: Less than 20% of subjects dropout after randomization
- Small problem: 20-30%
- Big problem: Over 30% (or over 20% and grossly uneven between two groups)
These next flaws are not always deal breakers. Only subtract a point (1) if the flaw is serious enough to potentially impact the bottom line.
Drop outs not included (intent to treat) (1)
People drop out of trials for a reason. They are likely the more severe cases, or the ones who get worse (or no better) in the trial. Leaving them out of the final analysis will bias the results. Look for “intent to treat analysis” and downgrade the trial if they didn’t do this and it likely affected the results.
There are various ways to analyze dropouts, including “last-observation-carried-forward” (carries last data point forward) and the more conservative “multiple imputation,” (averages multiple estimates for the missing value to better reflect uncertainty).
Enriched sample (1)?
Who tried to enroll in the trial? Were any treatments stopped or changed before they entered it? These questions might reveal biases that favor the treatment under investigation. Psychedelic trials tend to attract people who’ve taken the drug before (with good experiences). If an antidepressant trial required subjects to stop other antidepressants before entering the trial, they may come in with withdrawal symptoms that favor the new antidepressant,
Even when dropouts are accounted for, sometimes their sheer number makes the trial too uncertain. An arbitrary cut-off is > 20% drop outs, but some trials are bound to cross that threshold – such as long-term studies and those involving unstable patients like borderline or substance use disorders – so we may need to take what we can get.
Randomization not concealed (1)?
We want to avoid the risk that investigators will shunt certain types of patients into the active drug group, or that they can guess who got randomized to which group. Ideally, the randomization is removed from the investigator’s office, such as by phone, web, or through a pharmacist or centralized service (allocation concealment). Coded containers and SNOSE (sequentially numbered envelopes) also works.
If the randomization looks suspect and the baseline characteristics of the subjects are skewed, downgrade the trial.
Other
- How plausible are the results, judging from earlier studies? If the study concludes that stimulants treat psychosis, we’re going to need a lot more proof to move the needle.
- How relevant is the outcome? In a study of alcohol use disorder, did they measure cravings in a laboratory setting or actual use in the real world?
- How representative is the population? Did they draw from a sample of treatment seeking patients? What are the exclusion criteria? Sometimes when we lack data relevant to our patient we have to borrow from similar populations, like using a child ADHD trial to for an adult patient.
(if the final score is less than zero, use 1)
Observational Studies
These include any study where the investigator observes what happens in the natural environment. There may be a control group, but there is no randomization.
Observational studies are less reliable because cause and effect are hard to discern without randomization (correlation is not causation). Suppose a study finds that people who take antidepressants are more likely to die of heart disease. Are the deaths caused by antidepressants, depression, or some other factor?
Start with a rating from 1-5 depending on the study type, ranked here from lowest to highest quality:
- Case reports
- Case series
- Case control studies (these start with an outcome, then look back for cause, such as “how many people with heart disease took antidepressants?”) and cross-sectional studies
- Cohort studies (these start with an event, then look forward in time for an outcome, such as “how many people who take antidepressants go on to develop heart disease?”), and quasi-experimental studies
- Dramatic reports, such as 1) A case series with a large effect or a clear dose-response; 2) “All or nothing” studies, such as where everyone on the med developed a rash and no one who didn’t take it got a rash
- Add 1 if the study is high quality (prospective, large sample, likely to control most confounders, or Mendellian randomization) or supported by high quality systemic reviews or metaanalyses of observational studies
- Subtract 1 if low quality (retrospective, small sample, poorly controlled)
All observational studies end up with a rating 1-6.
Other Grading Systems
I’m grateful for the works below which influenced this system.
- GRADE System for assessing bias (Cochrane Group)
- Scottish Intercollegiate Guidelines Network (SIGN)
- Oxford Centre for Evidence-Based Medicine (CEBM)
- US Preventative Services Task Force (USPSTF)
- BIS FOES (a simplified system for psychiatry)
- World Federation of Societies of Biological Psychiatry (WFSBP)