Further examples

This issue of levels of sampling will come up right across biology. Imagine you are interested in comparing the microbial biodiversity in samples of soil collected from conventional and organic farms. Measuring microbial diversity on a soil sample is likely to be an inherently imprecise process, for example involving plating out samples and culturing different microbial groups: this process is subject to risk of contamination and all manner of other uncertainties, and so for each sample it would be sensible to split the sample and attempt to measure biodiversity by a range of different techniques then draw these results together to produce some consensus score for each sample.

If we are testing how effective an invasive bumble bee species is at pollinating flowers compared to a native species in a laboratory experiment then in our experimental design we have to decide (i) how many colonies of each species to use, (ii) how many worker bees from each colony, and (iii) how many different species of flower we should observe visits to. The decision will come down in part to which of these different factors has the greatest impact on variation between the pollinating successes of visits to a flower.

Statistical analyses

In our example of trees and forests we talked of taking the mean value of our measurement (number of beetle species) across trees in each forest (coniferous or broad leafed), and then making a comparison between forests of the two different types. You might make such a comparison using a t-test for example, with the two sample sizes being the number of different forests you visited of each of the two different types. If we included each tree as a separate sample in this statistical test then we would have committed pseudoreplication, since two trees in the same coniferous forest are likely to be more similar to each other than two trees from different coniferous forests. However, there is another way to do the analysis where you can include every tree as a data point. If for each tree we recorded a unique identifier for the particular forest that that tree came from (forestID) and the type of forest (foresttype: coniferous or broad leaved), you could then carry out a general linear model using both foresttype and forestID as factors and the recorded number of beetle species on each tree as the response variable. In fact this more complex analysis does not give you any more power to test for a difference between the two forest types than the simpler t-test comparing forest-level means. However, the general linear model also gives you a measure of variability between forests of the same type as well as testing for a difference between types. It is up to you to consider whether this extra information is worth the added complication of the analysis.

Relation to cluster sampling

Subsampling can be seen as having strong practical and conceptual links to cluster sampling. Imagine that we want to interview pupils in their final year of schooling in a particular country to explore their career ambitions. You could imagine that a database of all the students in the country could be assembled, and then a random sample of 1,000 pupils is drawn from that database. Quite likely all of these students go to different schools. The interviewer will have to spend a great deal of time travelling and will have to contact 1,000 different schools to arrange for access to the pupil and a space to conduct each interview. In order to reduce the costs of travel and organization a different approach could be taken to sampling. Instead, 50 schools are selected from a list of all the schools in the country. We visit each of these 50 schools and at each one select 20 pupils at random from their final year to interview. Thus, each school is a ‘cluster’ of pupils. Two pupils from the same school would not be independent measures of the career aspirations of children in that particular country, since they have likely been exposed to the same career-orientated sessions within their school. Thus, when analysing the data, our unit of analysis should be the school, and we should aggregate the responses of individuals within each school. In this way we can see that cluster sampling is functionally entirely equivalent to the subsampling that we discuss in the book. However, the driver of the two approaches is likely to be different. As discussed above, the attraction of cluster sampling is practical convenience. The driver behind subsampling is increased precision.

Back to top