A significant problem: will 95% significance continue to cut it?

What comes to mind when you read that scientists have found a significant relationship between, say, coffee drinking and throat cancer? If you have taken a class in statistics, the word “significant” likely sticks out. In statistics, this word means something quite different than it does in general parlance. When a scientist says that a certain finding is significant, all they mean is that we can be “mostly confident” that it is true. This is quite a downgrade from our everyday notion of the word! For most of us, a “significant” relationship between coffee and cancer means an important relationship; a big relationship. The message is: Stay away from coffee! In statistics, all it means is that there is probably some kind of relationship there. It could be big; it could be small; but there is probably something happening.

Tempered though it is, this statistical notion of significance is integral to modern science. Let’s say you are a scientist trying to find out whether a new pill is effective. Typically, the way to do this is to administer the pill to one group of people, and administer a placebo to another group, then measure the difference in outcomes between the two groups after the treatment period. Let’s suppose that you are conducting one such trial to test the effectiveness of a drug for reducing blood pressure. You find that the group who got the pill had an average reduction of three units of blood pressure, and that the group who got a placebo had an average reduction of one unit. Is there a difference between the two groups? Yes, it’s two points, but that isn’t the important question. It’s possible that the difference is a fluke, after all. The important question is this: Is that difference statistically significant?

A 2017 study which analysed 791 articles across 5 journals found that around half of them mistakenly assumed that a non-significant result meant no effect.

That is, is the difference between the two groups big enough for us to say that it probably wasn’t a fluke? There are statistical techniques for answering this question, and they form the backbone of modern science. What we are trying to do here is avoid false positives – saying there is an effect when in fact there is none.
For decades, the convention in science has been that the acceptable chances of a false positive are 5%. This figure is known as the “p value”, with ‘p’ standing for ‘probability’. In other words, scientists generally hope that the results of their experiments yield p-values of less than 5%. In the above case of our hypothetical blood pressure drug, we would be able to claim that the drug works if we can show that the chances that the difference between the two groups was a fluke are less than 5%. There doesn’t seem to be much of an issue there, so where does the controversy come in?

For decades, the convention in science has been that the acceptable chances of a false positive are 5%.

Well, imagine that rather than carrying out a single study, we are looking at the results from ten different studies, perhaps to get a better picture of the effectiveness of a particular drug. The studies we are looking at have the following chances of false-positive results: 3%, 5%, 2%, 1%, 8%, 10%, 6%, 4%, 5%, 4%. These look like pretty good odds in favour of this drug, don’t they? Sure, there was one study that had a one-in-ten chance of falsely reporting a positive result, but one-in-ten isn’t that bad. Common sense would dictate that the above studies largely support the effectiveness of this drug. However, in much of the scientific literature, such a collection of studies would be regarded as a “mixed” evidence base. Why? Because the convention is that a result is considered not statistically significant if the chances of it occurring by chance are above 5%.

This steadfast commitment to the arbitrary 5% cut-off for statistical significance has led to the formation of some appalling misconceptions in the scientific community. A 2017 study which analysed 791 articles across 5 journals found that around half of them mistakenly assumed that a non-significant result meant no effect. To use our blood pressure drug example, this is like saying that if the chances that we got our positive result by fluke are 6%, this means the drug is ineffective.

This is the kind of finding that makes one wonder where scientists get their reputation for being intelligent. A 94% chance that the result you found is correct is not that different from a 96% chance that the result you found is correct, yet the way science is practiced today, this 2% difference is total. In the minds of many scientists, a 94% chance that the drug works is translated to “the drug does not work”, while a 96% chance that the drug works is heard as “the drug works”. A body of literature on this drug, where there is everything from a 1% chance to a 15% chance of a false-positive, is thus often interpreted as “mixed” regarding the effectiveness of the drug.

How can we put an end to this? One way is to design better studies, right? We could design studies so well, and replicate them so precisely, that whatever p-value we get for one study, we are likely to get close to that in another study. In other words, instead of one study on the effectiveness of a drug having a p-value of 4%, and another one having a p-value of 7%, if we truly did replication properly, those figures would be more like 4% and 5%, or even 4% and 4%. Aside from purely practical reasons, such efforts would be doomed to fail. This is because even perfect replications are unlikely to yield similar p-values for purely statistical reasons. As Valentin Amrhein, Sander Greenland, and Blake McShane pointed out in a recent comment in Nature, for two identical studies with an 80% chance of achieving a p-value of below 5%, it would not be very surprising for one to get a p-value of .1%, and the other to get a p-value of 30%. Simple random variation means that both results are well within the realm of possibility, even if the replication is perfect. This is just one more reason that arbitrary sharp cut-offs for “significance” are unhelpful. Even perfect replications are not interpreted correctly in the current paradigm.

Even perfect replications are not interpreted correctly in the current paradigm.

Thus, the real problem here is not one of statistics, but one of human interpretation. By having a box labelled “significant” and a box labelled “not significant”, we can know instantly what to pay attention to, and what to ignore. Is the chance of a false positive above 5%? Ignore the study, it found no effect. Is the chance of a false positive below 5%? Great, we have an effect. The truth, however, is that this tendency towards neat categorization (or “dichotomania”, as some have called it) obscures a more subtle reality. That is that there are plenty of genuine effects which are counted as “insignificant”. To arbitrarily (and yes, this was done arbitrarily) set a cut-off point for “significance”, is to invite misunderstanding and oversimplification. Recall again that 50% of papers examined in one study assumed that insignificance meant “no effect”. Just consider how nonsensical it is for a drug to be considered not to have a certain side effect simply because tests showed that the chances it has that side effect are only 94%, instead of the required 96%.

This kind of illogic is rampant in the scientific literature, and it has important implications. Whether a study’s findings are “significant” often determines whether it is used for policy making, or whether a new line of research is pursued on its basis, or even whether it is published in the first place. On this last point, it is well-known that there exists a “publication bias” in academia, with studies finding significant effects being more likely to be published than those who don’t. It makes intuitive sense why this would be the case, too. Which article would you be more likely to read: “Study finds probably no link between coffee and cancer”, or “Study finds possible link between coffee and cancer”? This becomes a problem, however, because publishing papers is the lifeblood of an academic’s career. When journals are more likely to publish papers which find significant effects, this creates a malign incentive for researchers to fiddle with their statistical analyses so that they get significant results (a practice so common it’s been given a name: “p-hacking”).

One way this is often done is by not defining what variables you are looking at before you run your analyses. Take an example given by Naomi Altman & Martin Krzywinskia in Nature in 2016. They consider a hypothetical study where we are trying to determine which physiological variables predict blood pressure. There are 100 participants in this study, and we are measuring 10 physiological variables. We also suppose that none of these variables are actually predictive of blood pressure. If you take one of these variables, and run a statistical test to determine whether it is related to blood pressure, the chances that the test will (falsely) show it to be related is 5%. However, if run this statistical test on all of the variables, the chances that at least one of them will be statistically significant is 40%. This is simply because each variable has a certain chance of being (falsely) found to be related to blood pressure, so the more variables you look at, the higher the chances are that you will find one which appears to be related. Of course, if we were to run the same experiment again, we’d probably find that a different variable came up as related.

This is a clear example of statistical malpractice, but there is little stopping many researchers from doing this. As long as you don’t define which variables you are investigating ahead of time, you are able to pick whichever variables appear to have the strongest effect. The results you get, then, are often not genuine, but statistical artefacts. As a researcher, you are incentivised to do this because stronger and more significant effects are more likely to be published. This is a waste of time, money, and it undermines the credibility of the scientific project.

This kind of practice has been cited as one reason for the replication crisis currently plaguing the behavioural sciences, and it is another reason that a number of scientists are now calling for an end to the notion of statistical significance. Instead, they argue that studies should be considered more holistically, with p-values forming just part of the picture. This more all-encompassing view of a study would also take into account the background evidence, the quality of the data, study design, and the mechanisms underlying the effect. This broader evaluation of a study’s results would lead to a far greater understanding of the subjects under investigation. Furthermore, its implementation into science entails an improved incentive scheme for academics to publish quality research regardless of the p-values they obtain. The main issue with “significance” is that it is binary, and reality is not. If the goal of science is to uncover reality, then this should be reflected in its methods