Chapter 25 Comments on exercises

Chapter 1

  1. We are measuring people’s weight, but the scales are incorrectly calibrated, so the true measurement is always underestimated

    Comment: this is a systematic bias, because the defective scales lead to a bias in one direction. Note that scales that overestimated weight would also give systematic bias, whereas scales that were just erratic, and so gave very unstable readings, would be a source of random error.

  2. To get an index of social responsiveness, children’s interactions with others are measured in the classroom. The researcher makes one observation per child, with different children assessed at different times of day during different activities.

    Comment: this method is likely to create random error, because social interaction may be greater with some activities than others. Another way to think about it is that the measure will be ‘noisier’, i.e. more variable, than it would be if the researcher was able to keep the context of the observations more constant from child to child.

  3. In an online experiment where we measure children’s keypress responses on a comprehension test, the internet connection is poor and so drops out intermittently

    Comment: this is another context that will just increase the random error in the measures, but should not cause systematic bias towards fast or slow measures, provided the drop-out is random. On the other hand, if we were measuring children’s comprehension before and after intervention, and the signal dropout was only a problem during one of the test sessions, then this could introduce systematic bias, because only one set of measures (baseline or outcome) would be affected.

  4. In an intervention study with aphasic individuals, prior to treatment, vocabulary is measured on a naming test by a therapist in the clinic, and at follow-up it is measured at home by a carer who gives clues to the correct answer.

    Comment: This method is likely to produce systematic bias, because the conditions are more likely to lead to correct performance in the follow-up session. This could give an artificially inflated estimate of the effectiveness of intervention. In general, it is advisable to keep testing conditions as similar as possible for baseline and outcome measures. Note, though, that even if we used identical procedures at baseline and outcome, we might still have systematic bias associated with the passage of time, due to spontaneous recovery and greater familiarity with the naming task. This is considered further in Chapter 3.

Chapter 2

  1. The two bars on the left show results obtained in an experimental study where assignment to surgery was done at random, whereas the bars on the right show results from an observational study of patients who were not included in a trial. What do these plots show, and how might it be explained?

    Comment: In Chapter 10, we consider how a randomized trial aims to ensure we are comparing like with like, by assigning people to interventions at random, but there will be people who do not get included in the trial because they are unsuitable, or because they do not wish to take part, and their outcomes may be systematically different from those who take part in the trial. In this example, patients who are at high risk may not be invited to participate in a trial of surgery because they are too frail to undergo the operation. These people feature in the observational study on the right hand side of the plot, but are filtered out of the experimental trial as they do not meet criteria for inclusion.

Chapter 3

  1. A simple way to measure children’s language development is in term of utterance length. Roger Brown’s (Brown, 1973) classic work showed that in young children Mean Length of Utterance in morphemes (MLU) is a pretty good indicator of a child’s language level; this measure counts each part of speech (morpheme) as an element, as well as each word, so the utterance “he wants juice” has 4 morphemes (he + want + s + juice). Brown’s findings have stood the test of time, when much larger samples have been assessed. Is this evidence of reliability, validity and/or sensitivity?

    Comment: The plot does not give any information about reliability – we can only assess that if we have more than one MLU measurement on a set of children, so we can see how much they change from one occasion to the next.  The plot does support the validity of MLU as a measure of language development: we would expect language skills to increase with age, and this plot shows that they do on this measure.  MLU seems to be more sensitive as an index in younger children, where the slope of the increase with age is steeper. Indeed, it has been suggested it is less useful in children aged 4 years and over, though this may depend on whether one is taking spontaneous language samples, or attempting to elicit more complex language – e.g. in a narrative context. To get a clearer evaluation of the psychometric properties of MLU, we would need to plot not just the changes in mean MLU with age, but also the spread of scores around the mean at each age. If there is a wide range of MLU at a given age band, this suggests the measure may be sensitive at differentiating children with language difficulties from the remainder – though this would only be the case if the measure was also shown to be reliable.  If MLUs at a given age are all bunched together, then this shows that the measure does not differentiate between children, and it is unlikely to be a sensitive measure for an intervention study.

  2. Here we focus on reliability, i.e., the likelihood that you might see similar results if you did the same assessment on two different occasions. How might you expect the reliability of MLU to depend on:

Length of language sample

Whether the child is from a clinical or typically-developing sample

Whether the language sample comes from an interaction with a caregiver vs an interaction with an unfamiliar person

Comment: In general, the more data that is used for a measure, the more reliable it is likely to be. Suppose you had just 10 utterances for computing MLU. Then one very long utterance could exert a big impact on the MLU, whereas the effect would be less if the mean were based on 100 utterances. In terms of the nature of the sample, there’s no general rule, except to say that if a child from a clinical sample produces fewer utterances, then MLU will be based on fewer observations, so may be less reliable. Similarly, it’s hard to predict just how the type of interlocutor will affect a child’s utterances. If a child is shy in the presence of an unfamiliar person, it may be harder to get a good sample of utterances, and the MLU may change over time. On the other hand, someone who is skilled at interacting with children may be better at eliciting complex language from a child who would otherwise tend to be monosyllabic. The point here is to emphasize that psychometric properties of measures are not set in stone, but can depend on the context of measurement.

  1. Do a literature search to find out what is known about test-retest reliability of MLU. Did your answers to question 2 agree with the published evidence?

    Comment: You will find that tracking down relevant information is quite a major task, but it can be eye-opening to do such an exercise, and it is recommended to do a search for research on outcome measures before embarking on a study that uses them.

  2. You have a three-year-old child with a MLU in morphemes of 2.0. Is there enough information in Figure 3.5 to convert this to a standard score?

    Comment: The answer is no, because you only have information on the mean MLU at each age. To compute a standard score you would also need to know the standard deviation at each age.

Chapter 4

  1. In their analysis of the original data on the Hawthorne effect, Levitt & List (2011) found that output rose sharply on Mondays, regardless of whether artificial light was altered. Should we be concerned about possible effects of the day of the week or the time of day on intervention studies? For instance: Would it matter if all participants were given a baseline assessment on Monday and an outcome assessment on Friday? Or if those in the control group were tested in the afternoon, but those in the intervention group were tested in the morning?

Comment: In Chapter 5 we discuss further how designs that just compare the same individuals before and after intervention are likely to be misleading, in part because time is an unavoidable confound. Time of day can be important in some kinds of biomedical studies, and with children it may be related to attentiveness. In general, it would be advisable to avoid the kind of morning vs afternoon confound in the second example - as we shall see in Chapter 6, with a strong research design, the experimenter will be unaware of whether someone is in an intervention or control group, and so this problem should not arise. Day of the week is seldom controlled in intervention studies, and it could pose logistical problems to try to do so. Here again, a study design with a control group should take care of the problem, as any differences between Mondays and Fridays would be seen in both groups. Interpretation is much harder if we have no controls.

  1. EasyPeasy is an intervention for preschoolers which “provides game ideas to the parents of preschool children to encourage play-based learning at home, with the aim of developing children’s language development and self-regulation.” In a report by the Education Endowment Foundation, the rationale is given as follows: “The assumption within the theory of change is that the EasyPeasy intervention will change child self-regulation which will lead to accelerated development in language and communication and improved school readiness. The expectation is that this will be achieved through the nursery teachers engaging with the parents regarding EasyPeasy and the parents engaging with their children through the EasyPeasy games. As well as improved self-regulation and language and communication development from playing the games, the expectation is that there will also be an improved home learning environment due to greater parent-child interaction. The expected impact was that this will lead to an improvement in children’s readiness to learn along with improved parental engagement with the school.”

Suppose you had funds to evaluate EasyPeasy. What would you need to measure to assess outcomes and mechanism? What would you conclude if the outcomes improved but the mechanism measures showed no change?

Comment: The theory of change is quite complex as it involves both aspects of child development and of the environment. Within the child, the ultimate outcome of interest appears to be language and communication skills, whereas self-regulation is a mechanism through which these skills are improved. In addition, to fully test the model, we would need measures of the engagement of nursery teachers with parents, the home learning environment, and parental engagement with school, all of which appear to be regarded as mechanisms that will contribute to child language development.

One point to note is that although the theory of change would predict that the amount of improvement in language will vary with the amount of improvement in self-regulation and environmental measures, if this is not found, we would need to consider whether the measures were sensitive and reliable enough to detect change - see Chapter 3.

Chapter 9

  1. A study by Imhof et al. (2023) evaluated a video-coaching intervention with parents of children involved in a Head Start Program in the USA. Read the methods of the study and consider what measures were taken to encourage families to participate. How successful were they? What impact is there on the study of: (a) failure to recruit sufficient families into the study, (b) drop-out by those who had been enrolled in the intervention?

Comment: Failure to recruit sufficient families will affect the statistical power of the study, which is a topic covered by Chapter 13. In brief, it means that the study may fail to detect genuine effects of the intervention unless they are large - subtle improvements might be missed. As discussed in the current chapter, missing data from drop-outs can be handled statistically, provided the data meet certain assumptions. In this study, the researchers did a statistical test to demonstrate that the data could be regarded as “Missing Completely At Random”. They did not use imputation to handle the missing data, but used a method called Full Information Maximum Likelihood estimation, which is briefly described here: https://www.missingdata.nl/missing-data/missing-data-methods/full-information-maximum-likelihood/.

Chapter 12

  1. Figure 12.3 shows data from a small intervention trial in the form of a boxplot. Check your understanding of the annotation in the top left of the plot. How can you interpret the p-value? What does the 95% CI refer to? Is the “mean.diff” value the same as shown in the plot?
    Comment: To take the last question first, the “mean. diff” value of 9.11 is in fact slightly larger than the difference between the horizontal bars on the boxplot. This is because boxplots typically show the median rather than the mean group value. There are a great many ways of representing summary information, and the boxplot is one popular format. If you are unfamiliar with boxplots or any other types of graph you encounter, we recommend Googling for more information.
    The first line of the annotation gives the standard output for a t-test, including the t-value, degrees of freedom and associated p-value. The 95% confidence interval expresses the uncertainty around the estimate of the mean difference. Note that with such a small sample, you get a very large confidence interval. The confidence interval spans zero, reflecting the fact that the p-value is greater than .05 and so does not reach the conventional level of statistical significance.

Chapter 13

  1. We have two studies of a spelling intervention, both using increase in number of words correct on a spelling test as the outcome measure. In both studies, two groups of 20 children were compared. In study A, the intervention group gained 10 words on average, with standard deviation of 2, and the control group gained 5 words, also with standard deviation of 2. In study B, the intervention group gained 15 words on average, with standard deviation of 5, and the control group gained 8 words with standard deviation of 5. Which study provides stronger evidence of the effectiveness of the intervention?
    Comment: Although study B leads to a larger average gain in words correct, (15-8) = 7, the standard deviation is high. Study A has a smaller absolute gain, but with a much smaller standard deviation. We can convert the results to effect sizes, which shows that for study A, Cohen’s d is (10-5)/2 = 2.5, whereas in study B, Cohen’s d is (15-8)/5 = 1.4. So the evidence is stronger for study A. Having said that, in this fictitious example, both effect sizes are very large. With real-life studies, we would typically get effect sizes in the range of .3 to .4.

Chapter 14

  1. Type the phrase “did not survive Bonferroni correction” and “intervention” into a search engine to find a study that corrected for multiple corrections. Was the Bonferroni correction applied appropriately? Compare the account of the results in the Abstract and the Results section. Was the Bonferroni correction taken seriously?
    Comment: In terms of whether the Bonferroni correction was appropriate, you can check two things. First, was the alpha level adjusted appropriately for the number of tests? That is usually correct, as it is a simple sum! The other question to consider is whether the correction was applied to correlated variables. This is not an issue when the correction is applied to subgroups (e.g. males vs females) but it can affect findings when there are multiple measures on individuals. As explained in Chapter 14, if a set of measures is strongly intercorrelated, then the Bonferroni correction may be over-conservative - i.e. it may lead to a type II error, where a true effect gets missed. However, it is not always possible to judge whether this is the case, as authors may not report the intercorrelations between measures. Finally, it is not uncommon to find authors apply a Bonferroni correction and then fail to take it seriously. This is often evident when the Abstract refers only to the uncorrected statistics.

Chapter 16

Here are some characteristics of mediators and moderators. See if you can work out whether M refers to mediation or moderation.
1. In a M relationship, you can draw an arrow from intervention to M and then from M to the outcome. Mediation
2. M refers to the case when something acts upon the relationship between intevention and outcome, and changes its strength. Moderation
3. M can be thought of as a go-between for intervention and outcome: it is a variable that explains their relationship. Mediation
4. Figure 16.3 shows an intervention where both mediation and moderation are involved.
Comment: M1 is moderator and M2 is mediator

Chapter 19

  1. Re Calder et al. (2021) study: What could you conclude if:

There was no difference between Groups 1 and 2 on past tense items at phase 4, with neither group showing any improvement over baseline?
Comment: This would suggest the intervention was ineffective, at least over the time scale considered here.
There was no difference between Groups 1 and 2 at phase 4, with both groups showing significant improvement over baseline?
Comment: This would suggest that improvement in both groups might be due to practice on the test items.
Group 1 improved at phase 4, but Group 2 did not improve between phase 4 and phase 6?
Comment: This would be a puzzling result, that in effect is a failure of replication of the beneficial effect seen in Group 1. Failures to replicate can just reflect the play of chance - the improvement in Group 1 could be a fluke - or it could indicate that there are individual differences in response to intervention, and by chance, those in group 2 contained more cases who were unresponsive.

Is it realistic to consider a possible ‘nocebo’ effect during the extended waitlist for Group 2? Is there any way to check for this?
Comment: It is possible that children may get bored at repeatedly being assessed, especially if they are not receiving any help to master the test items. If so, then their performance might actually deteriorate across the baseline period.

Chapter 20

1. Re Koutsoftas study: Consider the following questions about this study.
a. What kind of design is this?
Comment: Multiple baseline across participants: comparison between 2 groups who are initially observed over a baseline period, and then have intervention introduced at different timepoints.
b. How well does this design guard against the biases shown in the first 5 rows of Table 20.3?
Comment: Inclusion of a baseline period establishes how much change we might see simply because of spontaneous improvement, practice effects and regression to the mean, and also helps establish how noisy the dependent variable is.
c. Could the fact that intervention was delivered in small groups affect study validity? (Clue: see Chapter 18).
Comment: This does make this more like a clustered trial, and introduces the possibility that those within a group might influence one another.
d. If you were designing a study to follow up on this result, what changes might you make to the study design?
Comment: It might be useful to have more baseline observations, to apply the intervention to individuals rather than groups, and/or to include another ‘control’ measure tapping into skills that were not trained.
e. What would be the logistic challenges in implementing these changes?
Comment: The methodological changes suggested in (d) would make the study harder to do; having more sessions could tax the patience of children and teachers, and might be difficult to accommodate with the school timetable; similarly, applying the intervention to individuals would require more personnel. It might also change the nature of the intervention if interactions between children are important. Adding another control measure would lengthen the amount of assessment, which could be difficult with young children.

3. Re Calder et al: Once you have studied Figure \(\ref{fig:Calderplus}\), consider whether you think the inclusion of the probes has strengthened your confidence in the conclusion that the intervention is effective.
Comment: The intervention effect was more pronounced for the trained items, and so cannot simply be attributed to practice or spontaneous improvement. This study illustrates how inclusion of control items that were not trained can add converging evidence to support the conclusion that an intervention is effective.

Chapter 24

  1. Read this preregistration of a study on Open Science Framework: https://osf.io/ecswy, and compare it with the article reporting results here: https://link.springer.com/article/10.1007/s11121-022-01455-4. Note any points where the report of the study diverges from the preregistration and consider why this might have happened. Do the changes from preregistration influence the conclusions you can draw from the study?

Comment: Compare your analysis with the evaluation of the study that was posted on Pubpeer here: https://pubpeer.com/publications/1AB6D8120F565F8660D2933A87AB87.

References

Brown, R. (1973). A First Language. Harvard University Press.
Calder, S. D., Claessen, M., Ebbels, S., & Leitão, S. (2021). The efficacy of an explicit intervention approach to improve past tense marking for early school-age children with Developmental Language Disorder. Journal of Speech, Language, and Hearing Research, 64(1), 91–104. https://doi.org/10.1044/2020_JSLHR-20-00132
Levitt, S. D., & List, J. A. (2011). Was there really a Hawthorne Effect at the Hawthorne Plant? An analysis of the original illumination experiments. American Economic Journal: Applied Economics, 3(1), 224–238. https://doi.org/10.1257/app.3.1.224