Supplementary section 6.5: Type I and type II errors

Let’s think a little about the different outcomes that we might get from an experiment that examines whether the size of a seed beetle is affected by competition between individuals that grow within the same plant seed. The four possible outcomes of such an experiment are shown in the table below:

What the real world is like

What our experiment detects

No effect of competition

Effect of competition

Competition does affect size

Type II error

Correct conclusion

Competition doesn’t affect size

Correct conclusion

Type I error

Let’s begin by thinking about the first row in the table. Here we are assuming that competition really does affect the size of beetles. If our experiment detects this difference as statistically significant (that is, unlikely to have occurred by chance), then we draw the correct conclusion that size is affected by competition. However, if we do not detect a statistically significant difference in the size of the beetles, our experiment will lead us to make the incorrect conclusion, that competition does not affect size when it actually does. This is referred to as a type II error. We can then define the type II error rate as the probability of making a type II error, or the probability that our experiment doesn’t detect a difference when in fact there is a difference to detect. Observant readers will spot that this is the opposite of the definition of power given in the book. In fact, power can be thought of as the probability of not making a type II error, and the two are related by the simple equation:

Type II error rate = (1 ? power).

Now let’s consider the second row of the table, the world where competition has no effect. In this situation, if our experiment does not detect any difference between the groups of beetles, we will draw the correct conclusion that competition does not affect size. However, if our experiment does detect a significant difference between the groups of beetles (because by chance we have picked unusually big beetles in one group and unusually small ones in the other), we will be mistaken in believing that competition does affect size when in fact it does not. This is referred to as a type I error, and again the type I error rate is the probability of making a type I error. The type I error rate of an experiment is entirely under the control of the experimenter, and will be determined by the significance level chosen for the statistical test. By convention, a type I error rate of 0.05 (or 5%) is regarded as acceptable in the life sciences. This means that there is a 1 in 20 chance that we will make a type I error. Now 1 in 20 might sound quite high to you - why don’t we set a smaller type I error rate of 1 in 100, or 1 in 10 000? Surely this would mean we were wrong less often. The problem is that the type II error rate of an experiment is also in part affected by the type I error rate chosen by the experimenter. If we reduce the probability of making a type I error, we automatically increase the chance of making a type II error. we cannot simultaneously minimize both kinds of error, so in the end we have to come up with some sort of compromise and this compromise is generally a type I error rate of 0.05. Of course, in situations where the consequences of the different types of errors differ greatly, we might choose a different value. For example, in deciding whether a substance has harmful effects on human embryos, we might decide that the consequences of failing to detect a real effect (making a type II error) are much worse than mistakenly finding an effect that does not really exist (making a type I error), and increase our type I error rate accordingly. On the other hand, for deciding whether an expensive drug is an effective treatment, we might decide that mistakenly concluding that the drug is effective when it is not would be very serious, and reduce our type I error rate.

Type I and type II errors and an additional cost of low-powered experiments

In the book we focused on the importance of high statistical power as a means of ensuring that we are not wasting time, money, and resources on studies of little value. The main cost that we emphasized was the risk of failing to detect a biological effect that was there to be detected-making a type II error. However, there is an additional and more subtle problem of conducting low-powered studies. When studies are low-powered, then even if they do detect effects these are more likely to be false positives or type I errors. To understand why, consider the following example. A team of researchers are screening chemicals which might be used as potential therapeutic drugs, and they have 100 to try. Now let us assume that five of these chemicals have genuine therapeutic effects, and the other 95 have no effect at all. To put this another way, in five of their studies the null hypothesis is false, whilst in the other 95 studies the null hypothesis is true.

Let’s start by considering what would happen if our researchers ensured that their studies had high power - say 80%. In that case we would expect to detect the therapeutic effects of four of the five chemicals that work, and fail to detect one of them. That is, we would expect to detect four true positives. What do we expect to happen in the 95 trials on the chemicals that have no effect? If we assume the researchers set a type I error rate of 1% (i.e. they will accept as significant any p-value of less than 0.05) then we expect then to detect false positive effects in about one of the 95 trials (or 0.95 to be precise!). Put another way, of all of the positive results they obtain in this research program 1/5 are false positives.

However, imagine that they had decided that to save money they would use fewer mice in their experiments, and as a consequence of this their power in each trial had dropped to 20%. The expected number of false positives remains the same (about one), but now the number of true positives has dropped from four to about one. The attempt to save money now means that half of any positive results they obtain are likely to be false positives.

So low-powered studies do not simply mean that you are less likely to get positive effects in situations where there is something to be found, they also mean that when you do detect positive effects they are more likely to be wrong.

Back to top