Lies, Damn Lies, and Evidence

Scientific evidence is treated as holy gospel by some people, and entirely discounted by others. When it comes to evidence for mental health, living at either end of this spectrum is a mistake. This article explains everything you need to know to understand mental health evidence, including its limitations.

This started out as a chapter in the book, but was eventually removed both because of length and complexity. It was replaced by a brief teaser in the “Paging Dr. Google” chapter, but we wanted to keep the original available for those who want the details.

Before proceeding, we’ll let you in on a secret about evidence. People tell you a treatment they want you to use has evidence. Doctors and other professionals may use an evidence-based approach. Science gives us objective confidence through evidence.

Evidence is important. However, hearing that there is evidence for a treatment often does not mean what you think it means.

Let’s say you’ve been diagnosed with major depressive disorder (MDD). You hear a statement like “There is strong evidence to support the use of HappyNow™ for treatment of MDD” or “HappyNow™ was shown to be significantly better than placebo (a sugar pill) at treating MDD.”

Most people would hear these phrases and think to themselves: “I have MDD. So if I take HappyNow™ it will help me! I’ll be better!” Unfortunately, most people would be wrong. This is not at all what they mean. We’ll explain, but here’s the bottom line:

 If someone tells you that ‘X’ will treat your mental illness and that there is evidence or proof that it works, there is a high probability that will be of little help to you.

Most people need to try several different treatments before they find one or more that fully treats their illness. People are often surprised that science doesn’t provide the level of certainty they expect. Why?

  • Evidence isn’t an absolute. There are different levels of evidence, some stronger than others. This is true in all areas of medicine.
  • Mental health research is tricky because mental illness is so complicated. Illnesses are made up of many possible symptoms, each which may have a multitude of different causes.*
  • The stronger the evidence, the more uniform the test subjects. Stronger evidence results from limiting variability. To reduce variability, experiments may exclude people who have other physical or mental illnesses, people of certain ages, social backgrounds, etc. But mental health symptoms are affected by many factors, often intertwined. Limiting variability may produce stronger evidence, but it may apply to very few people.

* The understanding of what causes certain symptoms is still unclear compared with more established areas of medicine, e.g. cardiology. People used to think depression was all about serotonin, but now think many other factors contribute, e.g. inflammation. A faulty understanding of causation can make data that is collected less relevant.

Thorough testing that takes into account the variability found in the real world is practically impossible. Tools to examine and measure results are also limited. Evidence in mental health, therefore, has more caveats. Evidence cannot reliably predict what will or won’t work for someone. It’s not like evidence in chemistry class where all you’re checking is if adding A to B turns B purple.

Evidence in mental health is no guarantee. If you’re comfortable with that conclusion, please feel free to skip to the next chapter. If you want a deeper understanding of why evidence in mental health is less clear-cut, keep reading. This will also help if you need to argue with people about treatments. Compelled to justify why you’re not drinking 10 cups of tea each day made from a stinky plant that a persistent relative pulled from their garden? Read on. Or get them to read it. While they’re drinking their tea.

Levels of Evidence

In medicine, people look for proof that a treatment will work. What they mean by proof is scientific evidence supporting the conclusion. The best treatments are evidence-based.

However, evidence-based is not a simple yes or no concept. The quality of evidence affects how much you should trust its conclusions. Determining the quality of evidence can be extremely challenging.

Evidence comes in different forms. You’ve likely heard of some, like randomized control trials (RCT’s). Medical evidence usually falls into one of the categories shown below.

1aSystematic review of randomized control trials
1b Individual randomized control trial
1cAll or none
2aSystematic review of cohort studies
2bIndividual cohort study
2c“Outcomes” research; ecological studies
3aSystematic review of case-control studies
3bIndividual case-control study
5Expert opinion

Some levels of evidence.

Adapted from Oxford Centre for Evidence-based Medicine—Levels of Evidence (March 2009). Their website provides an excellent set of resources to learn more about how evidence can be used and misused in medicine.

Evidence has the potential to be stronger the higher up it is on the list. Be skeptical when someone tells you that a treatment will work because it’s “evidence-based.” This is not an unqualified endorsement.

We’ll describe a few of these levels below. Let’s say you are a researcher who is interested in whether a new medication helps treat a specific illness.

At the lowest (weakest) level of evidence, you might follow one or more people who took the medication and describe what happens. This is called a case report or case series.

A case-control study tries to identify the cause of an illness. It’s not about treatment so wouldn’t help you with the new medication. It takes one group of people who have the illness and another who doesn’t have the illness. Researchers look at the medical histories of both groups. They try to identify any differences that might explain why one group has the illness.

In cohort studies, you gather a pool of research subjects. Those who are already taking the medication go in one group, and those not already taking the medication go in another group. You follow both groups over time and identify the differences between them.

Randomized control trials are the most well-known evidence-based clinical studies. You gather a group of patients who all have the illness. Half of the people in the group, picked randomly, are given the medication, while the other half aren’t given the medication. You follow everyone over time and compare the outcomes of both groups. In a double-blind RCT, neither the patients nor the researchers that measure the results know which patients were given the medication.

Systematic reviews (SR’s) or meta-analyses start with a literature search identifying previous studies. These include RCT’s, case-control, or others. The similarities and differences between individual studies are analyzed and conclusions are drawn. These “studies of studies” can vary in quality depending on how the individual studies are selected, their quality, and their methodologies.

If you have ten studies using the same methodology, the same patient restrictions, etc., the meta-analysis can’t draw broader conclusions than the individual studies. But a meta-analysis could include many studies that varied greatly in overall study design. For example, it could include different medication doses, lengths of time, ages, genders, people with multiple physical and mental health conditions, and so on. A review of many large and varied studies that all show the same result is strong evidence.

Quality of Evidence

Before trusting a study, you must review its methodology and quality. The size of the study, patient selection criterion, constraints, outcome measures, duration, dropouts, and many other factors of the study design have a significant impact on how results are interpreted. Have other researchers been able to take the same study design and replicate the results?

You may have a great study providing a low level of evidence. Or you may have a very flawed study which aims to provide a very high level of evidence. Neither are likely much help to you. How do you know if a study is comprehensive and methodologically sound or deeply flawed? Read articles that reference the published study and see if they support it or denounce it and why.

As you can see from even this very superficial overview, the reality of medical evidence is not straightforward. We’re not suggesting you need to be an expert in experimental design or statistics. Instead, recognize that evidence can mean many things and can evolve over time. You should look for strong evidence, ideally many large, long-term, independent, and well-designed RCT’s. Be cautious when relying on small observational studies or anecdotal reports, especially from only a single source.

Is Significant Always Significant?

When people claim that treatments are evidence-based, they aren’t lying to you. They’re not trying to deceive or mislead you. They’re just speaking a different language. The word significant is a perfect example. Significant has a technical meaning in statistics, which is something along the lines of “much more likely.” As in, we’re pretty sure the medication is better than nothing. How much better? That we don’t know. It could make you feel 0.1% better (i.e. who really cares) or 95% better (wow!). We’d refer to the latter as clinically significant, which is a completely different concept from statistical significance.

Here’s a silly example. Gather 5000 people. Give half of them, chosen at random, a placebo pill and the other half the medication you’re studying. Give them all a test of some kind. Let’s say that every single person who took the placebo scored 61/1000, while everyone who took your medication scored 62/1000. That is (statistically) significant because the likelihood the pill improved their score (vs. it being a random fluke) is very high.

Despite that, the result is likely not clinically significant. If someone offered you a 0.1% improvement in your mental illness, it’s not going to make a big difference in your life. Also, in real life, it would involve more than a single pill. To complicate matters even further, there are even different levels of statistical significance, e.g. “we’re confident it will be better 19 times out of 20” (ok, but not great) vs. “… better 999 times out of 1000” (very good).

What Did the Studies Measure?

Mental health studies follow one or more groups of people for a period of time. They then evaluate some aspect of their mental health, e.g. are people more or less depressed? It is these outcomes or results that provide the conclusions for the study. The question you have to ask yourself is if the outcomes matter to you. In other words, are the symptoms that improved the same symptoms you have? In many cases, the answer is either a clear “no” or the study doesn’t provide enough information for you to know.

Again, let’s assume you have MDD. If someone told you they had a cure for all mental illness (and an RCT to prove it!), you’d be thinking snake oil. If they said instead it was shown to be effective for mood problems, you might not completely ignore them. If they said (and you believe) there was statistically and clinically strong evidence their treatment helped with MDD? Now they’ve probably got your attention. Does it apply to you? Maybe.

Two people can have two completely different sets of symptoms, presenting in two completely different ways, and be diagnosed with the same mental illness. How different can they be?

  • Being diagnosed with a mental illness usually means meeting a certain number of criteria (for MDD, it’s at least five from a list of nine). One person could meet criterion 1-5 and another criterion 5-9. They have only one in common.
  • Each criterion has countless alternatives, e.g. “depressed or irritable mood.” Compare this to some physical illnesses where the diagnosis is based on a blood test. Positive or negative? There’s far more variability in mental illness.
  • Severity can be completely different.

Even a specific mental illness like MDD can be astronomically diverse. Studies that fully recognize this are inordinately expensive and complicated to run. Most studies can’t do that, so their reported outcomes are very general. At best, they can offer possibility and future promise.

Most studies use rating scales to measure symptoms before and after. A common one for depression (HAM-D) has 17 questions about different symptoms. Someone interviews you to determine a rating for each symptom. They give you a score between zero (you don’t have that symptom at all) and either two or four (you’re greatly suffering from that symptom). They add up all the questions and get a total score for how severe your depression is, as shown below.

8-13Mild Depression
14-18Moderate Depression
19-22Severe Depression
≥23Very Severe Depression

Interpreting scores on the Hamilton Rating Scale for Depression (HAM-D).

Imagine a study showed that a medication reduced MDD from severe to normal. This sounds spectacular! But if they measure it according to the total HAM-D score, that can happen by improving as few as three symptoms (out of 17).

By looking only at the total score, you have no idea which symptoms it improved or by how much. Errors in sampling may have led to more study participants with one set of symptoms. Even with a very large sample size, a medication that works really well on only a few symptoms would show a statistically significant effect. All this is great unless you have some of the symptoms that didn’t improve in the study. The medication has a much smaller chance of helping you than the overall result suggests. But you have no way of knowing because the study didn’t report individual symptoms, just the total HAM-D score.

Using a single number as a proxy for multiple individual measures is common. You ideally would like to see the effect on individual measures. That will help you determine if it’s (more) relevant to you. But if a study wants to look at many results, it needs to be much larger and more expensive. Achieving the same statistical significance as one with a single result increases the size and expense even further. There are other reasons why researchers like to report on a single measure. Saying “this medication is an effective treatment for depression” is a much better sound bite than “this medication effectively reduces feelings of guilt in patients with depression.” While the first has a better ring to it, it’s just not as useful when you’re making treatment decisions.

Who Did the Studies Include?

Encapsulating many symptoms behind a single score can hide a lot about what the study is actually measuring, as you just saw. But it’s not only the outcomes that you need to be concerned about. It’s also the inputs, otherwise known as the people who participated in the study. How varied was that “random sample” of subjects? Does it provide a true representation of the overall population? Most importantly, does it represent you?

It’s now generally recognized that for decades, medical studies had a very strong gender bias. Most studies used men as test subjects. There are statistical benefits of tightly controlling the pool of subjects. This helps strengthen the result, even though it applies to fewer people. It’s a stretch to broaden the conclusions without the data to back them up. Claiming that 70% of men who have a heart attack experience chest pain before based on evidence. Because nobody was studying women, everyone assumed they too had chest pain 70% of the time. When the studies were done, chest pain was much less common (around 58%). Other differences were even more pronounced.

A single study doesn’t need to cover every variation. But the more variations covered by multiple studies, the better. Other factors can include:

  • age (consider physical illnesses like heart disease and diabetes, now diagnosed at a younger age, and not commonly studied in that age group previously)
  • race and ethnicity (certain genetic traits are more common in some groups, including factors that affect how medications are metabolized; people of some races are far less likely to respond to some medications)
  • socio-economics (research has shown clear connections between economic status and patterns of diet, exercise, weight, sleep, and other factors that not only may affect the presentation of mental illness but also influence the effectiveness of treatments)
  • gender diversity (you would find few older studies, that did not automatically exclude those who did not fit the binary male/female model, leaving many intersex and transgender patients with data that may or may not apply to them)
  • health status (are people with certain physical illnesses well represented or excluded, e.g. heart disease, COPD, colitis, migraine, and is there any difference between how the treatments affect those with these illnesses than those without)

The list goes on. Do vegans react to a medication differently than carnivores? Is exposure to certain types of pollution a factor? Hair colouring? People can make educated guesses about what factors make a difference, but without the data, they don’t know for sure. They can only be 100% sure of the effect on everyone if they test everyone.

Even a large study with great variation can be tricky. Like with outcome measures, statistics can hide important variations in the subject population if not used well. If improvements were seen in the group as a whole, was it just one particular subset of test subjects that saw an improvement? Do they share something in common? Do you share that common factor? It could even be a different cause for their illness. This is something we neither understand well or can reliably measure. A treatment that affects MDD due to one cause may not touch MDD caused by something different.

Mental illness is diverse, complex, and rooted in multiple causes. All of these make reliable and informative statistical sampling more challenging.

So… Give Up?

We are not saying that evidence doesn’t matter—far from it. But it’s important to recognize there are limitations to the evidence available. More importantly, don’t be manipulated by deceptive language. You want to choose the treatments that have the best chance of helping you and treating your illness. Try to have realistic expectations. You don’t need more disappointment.

Some treatments are supported by strong evidence. This includes larger studies, broader sampling, and better outcome measures. Studies that can be independently replicated and high-quality meta-analyses are also important.

Evaluating the quality and applicability of research is something you should be doing with your doctor or other health professionals. Don’t just bring some paper and say you’re convinced the treatment it describes is credible and applicable to you, so you’re going to try it. Discuss with them which studies and treatments best apply to your illness. Remember that they have extensive expertise in interpreting medical literature. Use that to your (and their) advantage.

Accept that high-quality evidence isn’t available to answer every question. Sometimes doctors have to rely on gut instinct or clinical intuition. They’ve seen that “certain types” of people do well on one medication but not another, and that’s why they recommend it to you. They’re filling gaps in the evidence with (ad hoc, informal, and subconscious) observational studies of their own.

What we hope is to lower your expectations to match reality. Finding the right treatment for you may not be a quick or easy process. You will likely have to try several different things, some may partly work, some won’t work at all. It’s not a failure, it’s a necessary part of treating your illness.


  • An evidence-based treatment is not a guarantee. Evidence is not absolute. This is particularly true in mental health, where one illness can have different symptoms and different causes.
  • Stronger evidence is obtained by reducing variability and the scope of investigations. That decreases the chance that the evidence will apply to you.
  • There are many different types of evidence-producing studies. Some are more convincing than others, as is others replicating the studies and getting the same results.
  • Interpreting evidence is incredibly difficult and terms like “significant” are often used in a technical manner which can be misleading to casual readers. Doctors, in particular, have the background that can help you interpret evidence.