4 Rigorous ecology needs rigorous statistics

Mick Keough and Gerry Quinn

How to cite this chapter

Keogh M. J. and Quinn G. P. (2023) Rigorous ecology needs rigorous statistics. In: Cousens R. D. (ed.) Effective ecology: seeking success in a hard science. CRC Press, Taylor & Francis, Milton Park, United Kingdom, pp. 49-62. Doi:10.1201/9781003314332-4

Abstract Statistical analysis is central to how most ecologists assess the credibility of their ideas, so we rely on getting it right. The complexity of ecological systems makes the task difficult and provides additional opportunities to get it wrong, most often because of a mismatch between the statistical model fitted to the data and the ecological model of interest. This chapter outlines how the statistical toolbox used by ecologists has changed, and some risks and opportunities that accompany these changes, particularly for mixed effects models. There are also some old chestnuts that still lead us astray. We provide a brief guide to help ensure that statistical analyses answer ecological questions, rather than letting us be fooled by noise or inappropriate analyses.

4.1 Using models to distinguish signals from noise

Most ecologists use data to answer questions or to develop new ones. When answering questions, we take a model or hypothesis about how the world works and compare it to data, to see whether the model is consistent with the data (Chapter 1).

In developing new questions, we typically take a broad verbal model with lots of possible explanations and use the data to refine those explanations into a smaller number with the most support. These explanations are often themselves questions that need to be challenged with further data, particularly when we are interested in inferring causal relationships.

The process of matching ecological models and data uses statistical analysis to help us distinguish between ‘signals’ (the ecological processes of interest) and background ‘noise’. This step involves translating an ecological model into a formal statistical one. Analysis gives us confidence that we’re not being fooled by noise (Gelman and Loken 2014) or by our preconceptions.

Our conclusions, decisions about next research steps and advice we might offer to end-users all rely on our ability to separate signal from noise and to characterise that signal accurately and precisely.

This task, common to most biological disciplines, is particularly difficult for ecologists because our background noise is complex and often large. Working on whole organisms integrates the noise at subcellular levels up, and community and ecosystem work adds further emergent properties. Sources of noise also broaden as we move from controlled laboratory situations to the field. As we try to establish study systems at meaningful spatial and temporal scales, pragmatic decisions mean that we often finish with small sample sizes (and low power) that hinder the separation of signal from noise.

Good study design enables us to reduce noise and increase the likelihood of discovering ecological effects, while the statistical analysis appropriate to that design allows us to draw valid conclusions about those effects from the data.

4.2 Hard, but critical

While we acknowledge the difficulty of the task of separating signal from noise, we need to respond to what our statistical analysis tells us and we need two aspects of analysis: the evidence that we are not being fooled – that is, there’s a signal from the ecological effect of the process(es) under examination – and a description of the nature of that signal. When we know the signal, we can think about its ecological importance.

An ecological study does not exist in isolation. For many of us, it will be one step in a process towards deeper understanding of a topic, and we would like to know the steps that follow from the result. We also want peers to see how our work changes their understanding of the topic and how it might influence their next steps. For example, if the work is at the behest of a natural resource management agency, we want to know what the results mean for the management problem that prompted the research – should some management options be moved aside and others raised in priority?

At a personal level, the conclusions we draw from a statistical analysis guide our future research path. Errors are easy to make, and common. If we get fooled, we are also likely to fool others who read our work. We then need to rely on the self-correcting nature of scientific discovery, which recent reviews show not to work especially well, because repeating studies is rare (e.g. Fidler et al. 2017). At best we move briefly down an incorrect path, until our understanding is corrected. At worst, we and our peers are sent down an incorrect path for an extended period. The conclusions from an analysis also influence where we disseminate the work. In academia, exciting results are submitted to higher profile journals and more mundane results to lower ranked journals. Inconclusive results may lead to non-publication, entailing considerable ’research waste‘ (Purgar et al. 2022). Where we publish still matters, particularly for those early in their careers, as long as selection panels, funding agencies and promotion criteria keep using relatively uninformative journal-based metrics rather than consideration of the impact of individual research outputs.

Robust ecology relies on rigorous statistical analysis. For this link to be strong, we need to be very clear about the logical and statistical issues involved and aware of the pitfalls that can weaken it. Statistics is something that requires deep thought and care: it is not something that should be done by accepting defaults in software packages or thoughtless selection of drop-down menu options. Some methodological issues are relatively recent, while others are recidivists. We’ll outline a few of these issues and suggest ways of avoiding pitfalls. We will do this using examples from ecological experiments, while recognising that many of the issues are also applicable to other kinds of data collection.

The past decade has seen an increase in the diversity of approaches ecologists use for data analysis, particularly the use of mixed effect models and more sophisticated methods for model selection. Much of this diversity is summarised in the book edited by Fox et al. (2015), Ecological statistics: contemporary theory and application, and reflected in dedicated journals such as the British Ecological Society’s now open-source Methods in Ecology and Evolution.

4.3 Some things do not change

As the range of methods expands, some issues around statistical analysis still bubble away. We still argue about the ‘right’ way to draw statistical inferences, most notably in the case of the debate between frequentist versus Bayesian approaches (the ‘statistics wars’; Mayo 2018). These arguments are like low-level disturbances, recurring without obvious impact. In broad biological, and within that, ecological circles, the discussion can be relatively superficial, but there are some important issues of statistical philosophy involved (Spanos 2019). Each approach has its strengths and weaknesses, and a logically coherent framework can be developed for each (e.g. Mayo 2018; Gelman and Shalizi 2013). It is critical that we understand exactly how the different statistical approaches are used to answer ecological questions. Robust conclusions depend far more on how well data are collected and getting the design and statistical models right than on the inferential approach used, a point also made by McCarthy (2015). The right way to analyse data is to fit a statistical model appropriate to the data that is also linked directly to the ecological model or question.

Dubious statistical practices are highlighted periodically. The unethical sifting through data to find a response variable that is ‘statistically significant’ – fishing or P-hacking – is still a concern and can be hard to detect. As we use more complex mixed models, there is more scope for this behaviour by manipulating the classification of predictors as fixed or random (Box 4.1). Preregistration of questions, sampling designs, and statistical models can help prevent this problem (Gelman and Loken 2014), although its implementation can be difficult or flawed (Claesen et al. 2021).

Within these broad statistical discussions, there has been some positive change. The most notable one is the shift away from using the term ‘statistically significant’ (Wasserstein et al. 2019), which includes several issues, including the meaning of significant, the use of P-values, and thresholds for separating signal from noise. These issues are not new, and have been discussed by ecologists (e.g. the papers curated by Ellison et al. (2014)). For ecologists, there are four important points:

‘Statistically significant’ indicates that a signal has been detected in the data, but does not reflect the biological importance of that signal. A disappointingly high proportion of results sections of ecological papers and theses start with ‘we found a significant effect of…’, usually referring to a threshold like P < 0.05. This description does not convey the strength of the effect, which is likely the thing of most interest to readers.
Conversely, a statistically non-significant result is not automatically evidence of a weak or non-existent effect. It means only that no signal has been detected; whether that result is because the signal is weak or we have limited detection capacity requires further work.
P-values, while frequently misused, can be an important part of an analytical toolbox if interpreted correctly and used in conjunction with measures such as effect size (Wasserstein et al. 2019).
Thresholds, including those based on P-values, can be part of deciding whether we are being fooled by noise or see a signal, again if they are used carefully (Mayo and Hand 2022). Regardless of how we do it, we need to make decisions about signal and noise in deciding how we use the results of an ecological study.

Describing results in biological (rather than statistical) terms, which are linked to the original questions, has the added benefit of making us think more about effects that matter for the ecological question. Ideally, that thinking happens during study planning, rather than when crafting discussions after data analysis.

4.4 Changing (statistical) landscapes for ecologists

Irrespective of whether they use frequentist or Bayesian inference, modern ecologists have access to a much wider range of statistical tools than the traditional normal-based linear regression and ‘ANOVA’ models, and their generalised extensions to include binomial and Poisson distributions, that dominated previously. These include the widespread use of mixed effects models, along with species distribution models, generalised additive models that relax the requirement for models to be linear, boosted regression trees, etc. Of these, we will focus here on mixed effects models, since they are a widespread example of where new methods provide new pitfalls for the unwary. This broadening of approaches has been accompanied by a shift in statistical software, from a range of commercial packages to widespread use of R, which has benefits and challenges.

Mixed effects models have become popular among ecologists in part because they can be applied to some common experimental and sampling designs that rely on techniques like blocking, partial nesting and repeated measurements to reduce noise and make more effective use of scarce resources. These designs have been used widely in field experiments, but in the past they have been commonly analysed by shoehorning data into statistical models that assume normality, homogeneity of variance and independence of errors (e.g. ANOVA). The complication is that these designs have observations ‘clustered’ into experimental units such as subjects, plots, etc., and observations within clusters are more likely to be correlated than those in different clusters. Mixed effects models allow errors to be correlated, rather than be random and independent (Gelman et al. 2020).

When we try to generalise ecological results by extending our own studies to other places and times, perform coordinated large-scale experimental replication and conduct meta-analyses, we are asking whether ecological interactions vary. Mixed effects models are appropriate for these tasks.

The shift to mixed effects models can be liberating, as they can, depending on the software package used, accommodate a range of correlation patterns within the data, so they can allow us to model the data more precisely (Ives 2022). They rely on the classification of predictor variables as fixed or random (see Box 4.1). In a single experimental design, a random effect typically involves plots, times or subject organisms, while in coordinated studies, it is the individual site-specific data collection, and in a meta-analysis, the individual studies are seen as coming from a larger population of potential ecological studies, from each of which an ecological process is measured. Misclassification of effects can be an important source of error, usually treating random effects as fixed, treating them inconsistently within single studies or when random effects are inappropriately omitted from statistical models.

Box 4.1 Fixed versus random effects

When we build a statistical model for ecological data, one important decision is whether each predictor in the model is fixed or random. It is important because it influences the structure of the model and how effects are assessed. It also decides how we interpret the analysis results. Fixed versus random has been a subject for discussion in the statistical literature for some time, with no clear resolution. For ecologists, Bolker (2015) provided a good summary table of criteria, illustrated here for a categorical predictor.

An effect is fixed where:

The groups we use are all those of interest.
We are interested in comparing the effects of some of these specific groups.
If we did the study again, we would choose the same groups.
Some authors calculate the ratio of groups used (p) to groups available (P) and use it to determine the form of the statistical model, with p/P* ⇒ 1 for fixed, 0 for random.

A random effect is where:

The groups have been chosen from some larger population of potential levels.
The individual groups are not of much interest, but their variance might be.
We have some interest in generalising from the groups used to the larger population.
If we did the study again, it is unlikely that the same groups would be selected.

The idea of random effects allowing extrapolation, but inference from fixed effects being restricted, matters when trying to generate more general or more complex ecological understanding. If we document a particular set of processes in one study, our next step might be to see if those processes act similarly in other contexts. This might take the form of asking if the processes vary in different specific contexts or wanting to estimate the variance in ecological processes across space or time.

Statistically, any interaction that includes a random predictor is also a random effect.

4.4.1 It can be complicated

A specific predictor might be considered fixed or random depending on the aims of the study and the hypotheses of interest (e.g. Gelman and Hill 2006; West et al. 2015), and Quinn and Keough (2023) provide a more extended discussion (their Chapter 10). For the example in Box 4.2, we could view sites as fixed or random, depending on how we wanted to use the results. This can be the cause of much confusion.

If we include time (e.g. months or years) as a factor, we might wish to generalise to other times than the ones we use. However, it is hard to define a statistical population of times from which we could randomly draw specific values for our study. It might be possible for organisms with very short life spans, but hard to see choosing years randomly for long-lived organisms and still being funded or awarded a degree. The other problem with time is that even if we estimate temporal variances at the end of our study, the statistical population of times can only include past times. Generalising to the future requires many more assumptions than just in our statistical analysis.

4.4.2 A chance to make mistakes

The classification of an effect as fixed or random can alter how we fit and interpret statistical models, and random effects counterintuitively alter how we assess fixed ones. Redesignating a random effect as fixed typically results in some other fixed effects being assessed with more degrees of freedom. Only when we are clear how particular effects of interest will be evaluated can we make sensible decisions about where to allocate resources in our design.

There are two temptations to resist during analysis and interpretation; both involve flipping predictors between fixed and random. The first is to complete the study using fixed effects models but then generalise the results as if we had random effects. Our sense from the literature is that interpreting fixed effects as if they were random is common, but the reverse is rare.

A second situation is where a designed random effect is analysed as fixed. There are rare circumstances when this may be appropriate, largely when unexpected events disrupted data collection (Quinn and Keough 2023). Our impression is that changes at the analysis stage happen when there were design errors, and it is realised that tests for specific effects are weak or, most worryingly, as a form of P-hacking, producing a ‘statistically significant’ effect that was absent when a model with the right combination of fixed and random effects was used. So, getting it right matters, because it can affect the inferences that we draw from our results.

Such reclassification of predictors is inappropriate because the interpretation is not consistent with the model fitting. It is sometimes done deliberately, we suspect, as a way of ‘selling’ the results to a broader audience or higher impact journal, on the basis that the results are broadly applicable to other situations. The first temptation should be resisted, but it should be dealt with by vigilant referees and readers. The second is more worrying. Preregistration of experimental designs and statistical models is the best way to prevent this behaviour.

As an example of some of the issues around using linear mixed models, consider a standard repeated measures design, such as we might use to follow recovery from disturbance (Box 4.2), with areas sampled over time through their recovery. The observations in these data are not independent, because there are ‘sets’ derived from physical plots. We might expect that observations from the same plot would be more similar than those from plots further away and that observations closer in time would be more similar than those further apart. A ‘traditional’ approach to analysing this design requires quite restrictive assumptions about the patterns of covariance in the data, the compound symmetry or sphericity assumption familiar to many (see Quinn and Keough 2023).

Box 4.2 A complex ecological design, its rationale and implications for analysis

We’ll use the example of a field experiment on rocky shores to show the practical decisions that need to be made when designing a study and some sources of error in analysis and interpretation.

The research question was how an ecological ‘engineer’, the habitat-forming alga Hormosira banksii, recovers from physical disturbances varying in intensity and spatial pattern (Wootton and Keough 2016). Intensity was three levels of biomass removal, removed in a uniform, tightly clumped or random fashion, giving nine treatment combinations as a 3 × 3 factorial design. Two additional treatments were ‘benchmarks’ (or ‘controls’), no additional disturbance and complete canopy removal. Recovery was measured at five times.

Design issues

1. Resources

So far, we have a design with 11 treatments and five times, applied to 60 × 40 cm experimental units. For a completely randomised design, we would need 55n units on the shore, each of which would need to be marked (by drilling into the rock). In this study, we chose n = 4. This would take lots of space, time and resources. An alternative design would be to measure recovery through time in the same experimental units, so each unit is recorded five times, and we only need 11n units.

2. Noise reduction

Rocky shores are heterogeneous and spreading experimental units randomly across a wide area could result in lots of variation. To account for this noise, the experimental units were placed into four blocks of 11 units, with the blocks spread across the shore, but units within each block much closer together. This decision does not save resources but is intended to improve our ability to detect disturbance signals.

3. Generalising our results

We, and coastal managers, would like to know if we can generalise results from this kind of experiment, so this design was repeated at another shore 10 km away.

Statistical analysis issues

1. Correlated observations

There are two ways in which observations of canopy cover may be correlated with each other. We have time series of five values, and an experimental unit in one of our treatments is physically closer to units from different treatments in the same block than to other units of the same treatment.

2. Fixed and random effects

Some effects are clearly fixed – we established the treatments and chose when to sample recovery. Other effects are clearly random – the blocks were spread haphazardly across the shore, and the experimental units within each block were assigned randomly to treatments. We used two sites – shores separated by sandy stretches and a few kilometres apart, to assess generality. They could be considered fixed or random effects – they were chosen from a ‘population’ of geologically similar rocky shores, so they could be random, but with only two, it is not clear what that larger population might be, so they might better be considered fixed (Bolker 2015).

Where could we go wrong?

1. Forget about the repeated measurements and fit a model for a completely randomised design. There are problems here:

Our overall tests of the disturbance treatments have roughly five times more degrees of freedom than there are independent replicate units (i.e. we have one of the forms of pseudoreplication).
The assumption of independence of errors will be violated, and the consequences are hard to assess.

2. Omit blocks from our model

Degrees of freedom assigned to blocks are now pooled into the residual.
We should still be wary about assumption of independence.
If the spatial heterogeneity at the scale of blocks matters, we now have more background noise and less sensitivity.

3. Analyse shores as a fixed effect but discuss it as a random effect.

Mixed effects models based on maximum likelihood estimation require us to be explicit about fixed and random effects, are more robust for complex or unbalanced designs, and allow more flexibility in data analysis, but in their simplest form they rely on response variables that are normal or can be transformed to approximate normality.

Like linear fixed effects models, they can be expanded to consider biological responses that do not follow normal distributions as generalised linear mixed models (GLMMs) and relationships that are not linear as generalised additive mixed models (GAMMs). While these approaches broaden our horizons, they are not a panacea.

Somewhat cynically, a bigger toolbox allows the unthinking user more chances to use an inappropriate tool and perhaps get the wrong answer. GLMMs and GAMMs are inherently more complex and approaches to analysis and reporting practices are less standardised than in the more familiar ‘make normal and fit an ordinary least squares model’ approach. There is also the perception that the ability to deal with correlated data can solve the problem of some forms of pseudoreplication, but this is not the case – the same logical problems exist (Arnqvist 2020).

Along with the broadening of statistical methods, there has been a striking shift to R as the software package of choice. The R user community provides a bewildering diversity of packages, and it is not uncommon to come across multiple packages or ways to fit the same statistical model. The amount of modelling guidance varies with packages, but we suspect many users simply google their design to find appropriate R code. This approach should be a major positive step – anything that encourages us as ecologists to think more about how we fit statistical models to our data is an improvement over ‘cookie-cutter’ approaches. However, just using someone else’s code could also be seen as using a cookie-cutter approach, albeit a more unusually shaped one. In the absence of the validation and well-documented examples associated with standard analyses, it is critical that we examine code carefully and check that it is specifying the model we believe it to be.

4.5 There are some ‘new’ challenges

There are some emerging aspects of statistical analysis that challenge ecologists. We see the increased attention on the nature of ecological signals – effects – rather than merely significance as a positive step. These effects can be complex in many cases because we know that most ecological processes do not act independently of each other (i.e. interactions are important, and many are modulated by external environmental factors). If we cater to this complexity, the result can be designs with multiple predictors and combinations of fixed and random predictors, as in the example of Box 4.2. With this complexity, predicting and measuring effects become difficult.

Predicting effects should be an important part of study planning. We try to make sure we collect enough data to answer questions or distinguish between competing models. The framework used most for doing this is statistical power analysis. It will be familiar to most ecologists as a technique linked to hypothesis testing approaches, particularly specification of decision-error rates, but there are also Bayesian approaches. In principle, it is a very useful concept, but in practice it is difficult to apply except to the most simple of cases. Using power analysis to determine an appropriate sample size requires several things – specifying the exact statistical model that we will fit (good practice), understanding the background noise in the data (desirable) and identifying the kind of effect it is important to detect (crucial). When we have interactions, especially involving categorical predictors with many categories, identifying this effect gets difficult. As discussed earlier (in Chapters 1 and 3), our understanding of many ecosystems is not at a stage that allows precise predictions, and complex interactions can take many forms. These different forms can vary widely in their detectability, so the amount of sampling required depends on the form of that interaction. It is telling that most guides to power analysis focus on simple effects, rather than interactions! Specifying a meaningful effect size is a task that most biologists, not just ecologists, find very difficult. However, it is necessary for meaningful study planning. It could be argued that doing a study without this step is still a planning decision, albeit to detect any effect larger than our sample size and noise allow! We should also be thinking about whether effects are ecologically meaningful as part of interpreting analyses and crafting discussions, anyway, and we argue strongly for moving that thinking to the planning stage.

There are multiple, often software-specific, methods for obtaining ML/REML estimates of random effects, and providing a measure of uncertainty around them, and they can produce different answers (Box 4.3; Bolker 2015). Estimating random interaction effects is challenging, even though they might be effects of most interest (e.g. random slopes in multi-level regression models) and there is no simple recommendation for best practice in interpreting the effects. For example, our starting point in the case study of Box 4.2 could be the interaction between disturbance pattern and intensity (a fixed interaction), and whether that interaction varies between sites (a fixed or random effect, depending on how we frame the question). Technically, it is the interaction between these factors and time that assesses the recovery pathway from disturbance. After assessing these effects, we might move to site × intensity and type × pattern interactions.

Box 4.3 Estimating random effects – some nuts and bolts

Linear models are fitted in various ways, most commonly ordinary least squares (OLS) and maximum likelihood (ML), plus its variant restricted maximum likelihood (REML).

Random effects are usually defined in terms of the amount of variation that can be attributed statistically to them. OLS has been used in the past, but it does not produce reliable estimates or confidence intervals for random effects when sample sizes vary across the design, as is often the case with ecological data. ML and REML are preferred. Within a mixed model, ML is best for estimating fixed effects, but REML is more reliable for random effects. This can make the analysis stage messy when fixed and random effects are both of interest.

Added complications for ecologists include:

The terminology for mixed models can be confusing, including the formal model equations.
Different software, and even different packages within R, use different methods for fitting mixed effects models and obtaining ML/REML estimates of their fixed and random effects (e.g. penalised quasi-likelihood, Laplace approximation etc.: see West et al. 2015; Bolker 2015). These methods can result in different parameter estimates and confidence intervals, especially for complex models with interactions involving random effects.
The ability to incorporate different covariance structures varies across packages, as does the interpretability of what are often confusing warning messages.

The other issue to emerge recently with implications for ecologists is the replication crisis, originating in psychology, but spreading through scientific disciplines, and possibly to ecology (Fraser et al. 2020). At its core has been the inability to replicate influential results when studies are repeated, which incorporates several aspects already touched on here, including selective data analysis to produce ‘statistically significant’ results and failure to design studies to resolve questions unambiguously. Sadly, it is also linked to apparent fraud (see https://retractionwatch.com/). For ecologists, the idea of replication is more complex and scale-dependent (Shavit and Ellison 2017). It can mean creating multiple instances of the same conditions within a study (e.g. experimental treatments), so we can be confident about identifying effects and excluding noise as an explanation. It can also mean repeating studies to compare conclusions.

Replicating a study, in the sense of repeating it exactly, is not possible in ecology, because another study, even if done in the same laboratory or field site, will be under different background environmental conditions, at a different time and with organisms that were likely different genetically from those used in the earlier study. We suspect that most ecologists would not expect to get the same result, and in many ways, it may be more interesting if a strong ecological effect in one study disappears or reverses in another, because that suggests that the ecological effect is context dependent, and this topic is explored in greater depth in Chapter 5 (see Catford et al. 2022).

A more informative way to view the repeating of a study, using the same methods, is that it helps us generalise ecological ideas by determining how consistent they are across space and time – either testing context dependence or estimating the variation in ecological effects, as in the example of Box 4.2. We need to be conscious of two design aspects – replicating a study done at one place and time sufficiently to characterise the ecological effects in question, and then repeating the study enough times and places to be able to measure the variance of these effects or doing the study under different fixed conditions to measure the dependence of the ecological processes on other factors. Filazzola and Cahill (2021) explore some of these issues in more detail, and there are emerging protocols for large-scale coordinated repetition of experiments to measure context dependence (Borer et al. 2014).

4.6 A field guide to getting it wrong

We have so far outlined several ways that we might go astray.

How common are errors and poor statistical practice in ecology? It is hard to estimate the rates reliably. In the best example, the issues of pseudoreplication, confounding and inappropriate replication, introduced by Hurlbert (1984), one follow-up study a decade later (Heffner et al. 1996) found little had changed and Arnqvist (2020) cites several examples elsewhere in biology that suggest the issue is still common.

Our impression is that other problems raised earlier, such as confusing fixed and random effects and the persistence of ‘statistical significance’ rather than focusing on ecological effects, are common. The extent of these problems is hard to assess, because the worst examples we encounter tend to be at the review stage, and papers with substantial flaws may vanish. Whether or not we can quantify the flaws, each inconclusive study is a waste of resources, and when the flaws preclude publication, there is a cost to individual ecologists as well. Our goal is to identify the risks and avoid them.

A way to structure this risk avoidance is to think of the examination of an ecological idea as formulating a series of increasingly specific models. An ecological explanation or hypothesis is a simplified explanation of an ecological phenomenon. This initial model is often a qualitative description (e.g. growth of estuarine cordgrass is affected by salinity (S) and insect herbivory (H)). We’d like to know if this model is plausible, so the best way is to confront that model with data (Hilborn and Mangel 1997). There are various ways to do it (Chapter 3), but the most convincing is to do an experiment – create a scenario in which salinity and herbivory vary. A mismatch between expectations and data would expose our ecological model’s shortcomings.

The comparison of our ecological model to data cannot be done directly, and we need to translate the ecological model into a statistical one, which we can fit to the data. In this example, one model is G = μ + S + H + HS + ε.

This model will be familiar to many. It is a linear model, relating cordgrass growth to an overall population average growth (μ), plus independent effects of salinity and herbivory, plus a combined effect of salinity and herbivory, plus background ‘noise’ represented by the error terms (ε). For our statistical analysis, we want to know how well this model fits the experimental data and whether it fits better than models lacking HS, HS and H, HS and S, or HS, H and S.

Our goal is to use the result of this model fitting to make confident conclusions about our ecological model, and we need several conditions to be satisfied. These conditions include, with potential errors in italics:

In our data, we should not be able to identify competing explanations that could produce the same pattern. In the cordgrass example, creating our experimental treatments should have only involved changing salinity and herbivory, and not introducing artefacts associated with how we made these changes (i.e. we have appropriate controls). If there are multiple explanations, your ecological model is actually a group of models that cannot be separated.
The statistical model must match the biological one. Your analysis might give you a clear answer, but to a different question and not relevant to your ecological idea.
The statistical model must match the structure of your data. In most cases, problems happen when additional, often random, predictors should have been in the model. In the cordgrass example, we might model the growth of individual leaves without explicitly including the plant from which they arose and the experimental plot in which they grew. Your analysis probably appears more sensitive than it really is.
All statistical models have assumptions, often about the background noise. For example, the linear model above assumes that the ε values (error terms for individual observations) are independent, follow a specified distribution (normal, Poisson etc.), and have a specific pattern of variation (e.g. equal variances among groups). If these assumptions are not met, estimates of effects may be unreliable, along with conclusions about how well different models fit the data.
We should be confident about the conclusions (i.e. we have collected enough data to distinguish between different models and/or estimate the effects confidently). If we have not, rather than confident conclusions about (e.g. the effect of herbivory), we have a third outcome – we cannot tell. This third outcome might be expressed as very wide intervals around estimates of effects or low power around tests of specific effects. Both represent wasted resources.

Missteps at any of these points mean that we cannot link the statistical model back to the original ecological model or idea – we have not challenged that idea strongly, or severely, in Mayo’s (2018) approach.

There are two areas of high risk in this process. We have already touched on them, and they are related – misunderstanding replication and mismatch of statistical models to data structure (and ecological models).

Replication within studies is fundamental to statistical analysis – observing ecological responses under repeated instances of the same set of predictor variables is how we estimate the amount of noise that is unrelated to the predictors and estimate ecological effects confidently. Knowing this noise allows us to interpret patterns in the data when values of the predictors change (e.g. in an experiment). Estimates of background noise need to be representative of the noise in the overall (statistical) population and estimated confidently enough to allow good separation of signal and noise. The repeated instances should be independent occurrences of the same situation – multiple experimental or sampling units.

Multiple observations within the same experimental or sampling unit are not independent. In an experiment, they can tell us about the noise within that experimental unit (plot, cage, pot, etc.), but they tell us nothing about the variation between different units. Observations that are drawn from within individual sampling or experimental units (i.e. subsamples) are the source of one of the best known and persistent design and analysis flaws, pseudoreplication (Hurlbert 1984). At its most egregious, inappropriately using subsamples as true experimental or sampling units in an analysis can dramatically shrink confidence intervals around effects and greatly and erroneously increase the chance of gaining ‘statistically significant’ results from hypothesis tests. We can look at this issue in three ways using the framework we’ve outlined in Box 4.4, but the message is clear – the link between statistical model and ecological model is badly broken.

Box 4.4. Ways of looking at pseudoreplication, the problem that won’t go away

We will stay with the cordgrass example. Let us suppose, in the worst case, that insect herbivory could only be manipulated in large plots, and we set up our experiment with eight large plots, each with one combination of two herbivory and four salinity levels. Within each plot, we measure the growth of 20 plants. Everything is normal, so we fit a linear model and generate a familiar analysis of variance table. We will illustrate the issues using F-ratios to test hypotheses. Let us pretend for now that the experimental unit is the plant rather than the plot.

Source of variation	Degrees of freedom	Denominator for test
Herbivory	1	Residual
Salinity	3	Residual
Herbivory x salinity	3	Residual
Residual (error)	152

This might fill us with joy, as, for example, an F-test with 3, 152 degrees of freedom is powerful. This joy is unwarranted, and may even turn to despair with appropriate analysis.

Here is why the above analysis is wrong:

1. Clear thinking about replication means that the ‘treatment’ was to exclude herbivores (and vary salinity) at a plot scale, and we have only done this once for each combination. This experiment is unreplicated. Stop, do not pass Go, and regret all those plants tagged and measured unnecessarily. If desperate, we could make crude tests about the simple (independent) effects of herbivory and salinity using mean values for each plot.

2. Our ‘replicate’ observations share two features, the same combination of herbivory and salinity, and whatever other things associated with that specific plot – the things we did to alter herbivory and salinity, the ecological history of that plot, its sediment characteristics, tidal elevation, etc. We might perform the analysis, but we cannot separate the effects of herbivory and salinity from these other things. Our explanations are confounded, and several ecological models could explain the data.

3. If we want to include all of the observations in the analysis, we need a more complex model that includes plots: G = μ + H + S + HS + P(HS) + ε. We could fit that model to the data:

Source of variation	Degrees of freedom	Denominator for test
Herbivory	1	Plots
Salinity	3	Plots
Herbivory x salinity	3	Plots
Plots within HS combinations	0	Residual
Residual (error) - variation within plots	152

Looking at this model makes the problem stark. With n = 1 plot per treatment combination, the degrees of freedom, 4x2x(n-1), is 0. We have no test for the effects of herbivory and salinity, unless we are willing to assume that the small-scale variation within plots is a good proxy for larger scale variation (see Chapter 5). Our original statistical model does not match the data structure, and when we use the right model, it breaks down. Harking back to original discussions, plots are a random effect that must be included in the statistical model.

The example of Box 4.4 introduces the other main area for making mistakes – mixing up fixed and random effects and what we do with random effects included to reduce noise. Including random effects into a fixed effects model changes how we estimate and test the fixed effects, e.g. in a two-factor design, the effect of the fixed factor is tested against the residual if the other factor is fixed but tested against the interaction with fewer df if the other factor is random. Getting the effects classified correctly is an essential part of fitting the statistical model and being sure it gives the answers of interest, rather than answers to different questions. Our impression is that random effects are more often misclassified as fixed than vice versa, and the net effect is usually to increase the likelihood of a ‘statistically significant’ effect when testing hypotheses or inappropriately narrow confidence intervals around effects.

Ecologists generally include random effects in data collection, as a tool to reduce noise or as part of assessing the generality of ideas, and when subsampling has been used. Subsampling is used when whole treatment units are too large or expensive to measure, and several subsamples can provide a reliable estimate of a treatment unit’s state. When used for noise reduction, there has been discussion whether a random effect that accounts for very little variation should be omitted from the model and a simpler model fitted to the data. This simpler model could be seen as setting the random effect to zero or pooling it with the residual variation. Opinions differ on this approach (West et al. 2015; Janky 2000). In the example of Box 4.2, if we focus on the treatments that represent the nine combinations of disturbance intensity and pattern and the four blocks, we have three fixed effects (intensity, pattern and intensity × pattern) and three random effects (blocks, blocks × intensity and blocks × pattern), so we are ‘sacrificing’ degrees of freedom to remove medium-grained (Chapter 5) environmental noise. Would we be better off excluding some of the random effects? The answer depends in part on our aims – are we trying to assess which predictors are important or do we want to find the most parsimonious model?

We have three pieces of advice:

If the random predictors were part of the design structure (e.g. the blocks in Box 4.2), we would be reluctant to drop them. If they were measured as an additional variable and not part of the design structure, we could consider dropping them from the model. We would only do that when we are confident that those random effects explain very little variation.
If the random effects are associated with experimental or sampling units from which subsamples were taken, they should never be dropped, because that would effectively treat the subsamples as true units, creating the issues associated with inappropriate replication (Colegrave and Ruxton 2017; Arnqvist 2020).
One valuable step at the analysis stage is to look at the true replication in the data, and if there are subsamples of the replicate units, aggregate them into summary statistics for each replicate unit when preparing data files for formal analysis. This step has the dual benefits of reducing the likelihood of artificially inflating degrees of freedom and making the response variable more likely to match the assumptions of the statistical model (e.g. through the ‘central limit theorem’).

4.7 So what?

We know how to avoid pitfalls in study design and data analysis.

It does not require new approaches, but application of what is often touted as criteria for strong inference. Somewhat pithily: think clearly, think in advance and think before you analyse.

The first part of thinking clearly is not statistical, but about the logic of enquiry. The ecological question, the phenomenon and its possible explanation(s), must be described clearly, and we need to think clearly and critically about the kind of data we need to distinguish between different models or explanations.

As we move to data collection and statistical analysis, there is a sequence of decisions to be made, and they should be made before data are collected.

That sequence could be framed as a series of questions:

What (ecological) question do you hope to answer with your data?
Describe the structure of your sampling:
- What kind of data would answer the question?
- What are the experimental or sampling units?
- Should other predictors be included to reduce noise?
What kinds of ecological responses will you record?
What randomisation procedures will you use?
What statistical model will you fit to your data? If you’ll be using a linear model of some kind, write it out.
- Are there interactions between predictors?
- Which predictors are fixed effects, and which are random?
What are the assumptions of this model? Are some assumptions more important than others?
How many samples are required to confidently distinguish an ecologically important effect from noise?

These steps should all occur before you start. It can help to treat them as a formal checklist, even if we have not moved to formalised protocols such as the PRISMA scheme for meta-analysis (O’Dea et al. 2021).

Doing this ‘pre-analysis’ makes us well placed for when we have data, and it should speed things up. In the field, however, things will often go wrong, and some of our initial expectations (e.g. about data distributions and variances) may prove to be inaccurate. We may need to adjust our statistical analysis to account for additional messiness, and check data for outliers, assumptions, etc.

We must resist the temptation to change the model we use to improve our chances of getting a ‘desirable’ result. Transparency about our analysis is essential and is an argument for pre-registration. Pre-registration does not preclude changing the analysis, but it pushes us to explain any changes. The recent trend towards availability of raw data and, where relevant, R code, also improves confidence in the statistical analysis.

The last check we should do is when interpreting the results – and we need to remain vigilant here – particularly remembering earlier decisions about fixed and random effects. We need to be sure that ecological and statistical models are interpreted consistently.

Collecting data to answer ecological questions, particularly in the field, is messy, and there are pitfalls. Those pitfalls can be avoided. We can avoid some of the mess through careful design. Statistical analysis needs to be flexible, rather than recipe based, while making sure that ecological and statistical models are tightly linked. We should be our own harshest critics.

4.8 References

Arnqvist, G. 2020. Mixed models offer no freedom from degrees of freedom. Trends in Ecology and Evolution 35:329-335. doi:10.1016/j.tree.2019.12.004

Bolker, B. M. 2015. Linear and generalized linear mixed models. In Ecological statistics: Contemporary theory and application, eds G. A. Fox, S. Negrete-Yankelevich and V. J. Sosa, 309-334. Oxford: Oxford University Press.

Borer, E. T., W. S. Harpole, P. B. Adler, et al. 2014. Finding generality in ecology: a model for globally distributed experiments. Methods in Ecology and Evolution 5:65-73. doi:10.1111/2041-210x.12125

Catford, J. A., J. R. U. Wilson, P. Pysek, et al. 2022. Addressing context dependence in ecology. Trends in Ecology and Evolution 37:158-170. doi:10.1016/j.tree.2021.09.007

Claesen, A., S. Gomes, F. Tuerlinckx, and W. Vanpaemel. 2021. Comparing dream to reality: an assessment of adherence of the first generation of preregistered studies. Royal Society Open Science 8:211037. doi:10.1098/rsos.211037

Colegrave, N., and G. D. Ruxton. 2017. Statistical model specification and power: recommendations on the use of test-qualified pooling in analysis of experimental data. Proceedings of the Royal Society B-Biological Sciences 284:20161850. doi:10.1098/rspb.2016.1850

Fidler, F., Y. E. Chee, B. C. Wintle, et al. 2017. Metaresearch for evaluating reproducibility in ecology and evolution. BioScience 67:282-289. doi:10.1093/biosci/biw159

Filazzola, A., and J. F. Cahill. 2021. Replication in field ecology: Identifying challenges and proposing solutions. Methods in Ecology and Evolution 12:1780-1792. doi:10.1111/2041-210x.13657

Fox, G. A., S. NegreteYankelevich, and V. J. Sosa, eds. 2015. Ecological statistics: Contemporary theory and application. Oxford: Oxford University Press.

Fraser, H., A. Barnett, T. H. Parker, and F. Fidler. 2020. The role of replication studies in ecology. Ecology and Evolution 10:5197-5207. doi:10.1002/ece3.6330

Gelman, A., J. Hill, and A. Vehtari. 2020. Regression and other stories. Cambridge: Cambridge University Press. doi: 10.1017/9781139161879

Gelman, A., and E. Loken. 2014. The statistical crisis in science. American Scientist 102:460-465. doi:10.1511/2014.111.460

Gelman, A., and C. R. Shalizi. 2013. Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology 66:8-38. doi:10.1111/j.2044-8317.2011.02037.x

Gotelli, N. J., and A. M. Ellison. 2012. A primer of ecological statistics. Second edn. Sunderland: Sinauer.

Heffner, R. A., M. J. Butler, and C. K. Reilly. 1996. Pseudoreplication revisited. Ecology 77:2558-2562. doi:10.2307/2265754

Hilborn, R., and M. Mangel. 1997. The ecological detective: Confronting models with data. Princeton: Princeton University Press.

Hurlbert, S. H. 1984. Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54:187-211. doi:10.2307/1942661

Ives, A. R. 2022. Random errors are neither: On the interpretation of correlated data. Methods in Ecology and Evolution 13:2092-2105. doi:10.1111/2041-210X.13971

Janky, D. G. 2000. Sometimes pooling for analysis of variance hypothesis tests: A review and study of a split-plot model. American Statistician 54:269-279. doi:10.2307/2685778

Mayo, D. G. 2018. Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge: Cambridge University Press. doi:10.1017/9781107286184

Mayo, D. G., and D. Hand. 2022. Statistical significance and its critics: practicing damaging science, or damaging scientific practice? Synthese 200:220. doi:10.1007/s11229-022-03692-0

McCarthy, M. A. 2015. Approaches to statistical inference. In Ecological statistics: Contemporary theory and application, eds G. A. Fox, S. Negrete-Yankelevich and V. J. Sosa, 15-43. Oxford: Oxford University Press.

O’Dea, R. E., M. Lagisz, M. D. Jennions, et al. 2021. Preferred reporting items for systematic reviews and meta-analyses in ecology and evolutionary biology: a PRISMA extension. Biological Reviews 96:1695-1722. doi:10.1111/brv.12721

Purgar, M., T. Klanjscek, and A. Culina. 2022. Quantifying research waste in ecology. Nature Ecology and Evolution 6:1390-1397. doi:10.1038/s41559-022-01820-0

Quinn, G. P., and M. J. Keough. 2023. Experimental design and data analysis for biologists. 2nd edn. Cambridge: Cambridge University Press. doi:10.1017/9781139568173

Shavit, A., and A. M. Ellison. 2017. Toward a taxonomy of scientific replication. In Stepping in the same river twice, eds A. Shavit and A. M. Ellison, 3-22. New Haven: Yale University Press.

Spanos, A. 2019. Probability theory and statistical inference: empirical modeling with observational data. 2nd edn. Cambridge: Cambridge University Press. doi:10.1017/9781316882825

Wasserstein, R. L., A. L. Schirm, and N. A. Lazar. 2019. Moving to a world beyond ‘p < 0.05’. The American Statistician 73:1-19. doi:10.1080/00031305.2019.1583913

West, B. T., K. B. Welch, and A. T. Galecki. 2015. Linear Mixed Models: a practical guide using statistical software. Second edn. Boca Raton: CRC Press.

Wootton, H. F., and M. J. Keough. 2016. Disturbance type and intensity combine to affect resilience of an intertidal community. Marine Ecology Progress Series 560:121-133. doi:10.3354/meps11861

3 What sort of a science is Ecology?

5 Ecological scale and context dependence