40 Reading and critiquing research
So far, you have learnt the about process of research: asking a RQ, designing a study, collecting data, describing and summarising the data, and analysing the data (confidence intervals; hypothesis tests). In this chapter, you will learn to:
 read and critique research.
40.1 Introduction
Scientific practice requires reading the research of others. Current practice and advances in every evidencebased discipline build on research, so being able to critique the research of others is important. (A critique evaluates: identifying what is good, and what can be improved.) Research is usually communicated in journal articles or, less formally, in presentations (Chap. 39).
At some time during your studies or employment, you will need to read researches articles: to understand current practices in your discipline; to know why your discipline does things as it does; to critique the evidence for current or new practices; and to identify open or unresolved questions in your discipline. Understanding the language and concepts of research is important for understanding these articles, even if you will not be conducting your own research.
Reading research articles can be challenging. Rather than reading articles thoroughly from start to finish, first read the Abstract (also called a Summary, or Overview) to obtain a useful overview of the whole paper, without the details. Then, read the Conclusion highlights the important findings. Next, skim the rest of the article (perhaps focusing on graphs and tables of results). Finally, if necessary, read the paper for details if needed.
Terminology varies widely in research (Sect. 39.5.3). For example, explanatory variables can be called, among other terms, independent variables, predictors, regressors, covariates (P. K. Dunn et al. 2016). Check the terminology being used if unsure!
The six steps of the research process (Sect. 1.5) can be used as a guide to critiquing the research, though the expectation vary greatly between disciplines and between journals:

Asking the question:
 What research question is the paper answering?
 Why is this important?
 To what population will the results apply?
 Are there important inclusion and/or exclusion criteria that apply?
 What are the units of analysis and observation?
 Are the definitions clear and appropriate?

Designing the study:
 Is the study observational or experimental?
 Is the study welldesigned?
 How many individuals are in the study?
 How was the sample obtained, and what are the implications for external validity?
 Is the study designed to maximize internal validity?
 What are the design limitations?
 Are there ethical concerns?
 What is the source of funding?

Collecting the data:
 How were the data collected?
 Is the necessary details provided so the study be approximately replicated?

Summarising the data:
 Is the data summary appropriate, complete and clear?
 What do these summaries reveal about the data?
 What do the tables and graphs reveal about the data and relationships?

Analysing the data:
 What types of confidence intervals and/or hypothesis tests were used?
 Is the analysis appropriate, accurate, valid and clear?
 What do the results mean?
 Are the results statistically valid?

Reporting the results:
 What are the main conclusions, and how do they answer the RQ?
 Are the conclusion consistent with the results?
 Are the results accurate, appropriate and wellreported?
 Are the results of practical importance?
 Are the study limitations acknowledged, and their implications discussed?
 What other questions have emerged?
40.2 Example: blue light and sleep
The Abstract of a study of the impact of 'blue light' emitted by electronic devices on sleep (Randjelović et al. 2023), slightly edited for clarity, appears below:
The exposure of humans to artificial light at night... with predominant blue part of the visible spectrum is strongly influencing...sleep...
We hypothesized that reducing the amount of emitted blue light from screens of mobile phones during the night will increase sleep quality in our student population.
The aim [...] was to investigate the effect of reducing blue light from smartphone screen during the night on subjective quality of sleep among students of medicine.
The target population was students of medicine aged \(20\) to \(22\) years old of both sexes. The primary outcome of the study was subjective sleep quality, assessed by the Serbian version of the Pittsburgh Sleep Quality Index (PSQI).The mean total PSQI score before intervention was \(6.83\pm 2.73\) (bad), while after the intervention the same score was statistically significant reduced to \(3.93 \pm 1.68\) (good)...
The study has shown that a reduction of blue light emission from LED backlight screens of mobile phones during the night leads to improved subjective quality of sleep in students...
As this is the Abstract, many details will be absent (but explained in the article itself). Nonetheless, a lot can be learnt about the study from the Abstract:

Asking the RQ:
 This is a repeatedmeasures RQ: data are collected before and after the intervention from the same people.
 The population is 'students of medicine aged \(20\) to \(22\) years old'.
 The units of analysis and observation are the individual students in the study. There are two units of observation for each unit of analysis: each person has a before and after measurement.
 The outcome is the (average) subjective sleep quality, as measured by PSQI.

Designing the study:
 The sampling method is not stated, but likely to be voluntaryresponse.
 The sample size is not given.

Analyse the data:
 A quantitative variable is being compared within individuals, so a paired \(t\)test is the likely method of analysis (Sect. 33).

Report the results:
 Means are given before and after, but no information on the change is given.
 The numbers that follow the \(\pm\) are not explained: are they confidence interval limits, standard deviations, IQRs, ranges, standard errors?
 No \(P\)values or confidence intervals for the change are given.
Reading more of the article obviously provides more information.
40.2.1 Further details
Further details revealed from reading the article include:
 The population comprises students of medicine from the University of Nis, Helsinki aged \(20\) to \(22\) (p. 336), so the results will only apply to these people. However, other students of a similar age are probably not substantially different in this regard; however, the results may not apply to older people (whose phone use, for instance, may be quite different).
 The inclusion criteria are 'owning and daily usage of mobile phone with Android operating system and AMOLED screen' (p. 336). This suggests that results may not apply to users of iPhones.
 The exclusion criteria are 'sleep disorders, usage of sedative drugs, psychoactive substances, usage of phone apps or glasses that reduce blue light during the night, recent stressful situations' (p. 336).
 The response variable is the PSQI total score, which ranges from \(0\) to \(21\); the Abstract implies smaller scores are better.
 The intervention is the use of a 'free Android application Twilight [...] on mobile phones of participants [which] automatically decreases brightness and content of emitted blue color from the screen during the night time' (p. 336).
 Thirty students (p. 337) were used in the study (the study started with \(37\) students; seven dropped out).
 Results are 'presented as mean and standard deviation before and after intervention' (p. 337), so presumably the numbers in the Abstract after the \(\pm\) are standard deviations.
 The method of analysis was a paired \(t\)test, as presumed (p. 337). The \(P\)value for the \(t\)test is \(p < 0.0001\) (p. 337), so there is very strong evidence to support a change in mean PSQI.
 Ethical approval was granted (p. 336).
 The funding was from the Ministry of Education, Science and Technological Development, Republic of Serbia (p. 341), suggesting no conflicts of interest.
 The data are not available, so the research is not completely reproducible.
40.2.2 Strengths
Strengths of the study include:
 The researchers compared the students who remained in the study with those who dropped out of the study (Table 1); they found no evidence that those who dropped out and those who stayed were different on the variables studied (i.e., the drop outs did not introduce a bias).
 The sample size of \(n = 30\) suggests the \(t\)test is statistically valid. The required sample size is estimated using software (p. 336).
 An excellent caseprofile plot is shown of the data (their Fig. 1), and the data usefully summarised in their Table 2.
The authors identify strengths of the study as (p. 341):
... being specific to investigate the impact of mobile phones to sleep quality, natural setting of the intervention, precalculated sample size with appropriate achieved power and significant effect sizes reported (medium to large).
40.2.3 Limitations
Limitations of the study identified include:
 The study is conducted 'in complete dark room without additional light'; that is, a partially artificial environment, so the results may not be ecologically valid.
 Since 'each participant was informed about the detailed plan of the study...' due to ethics requirements, (p. 336) the participants were not blinded to being in a study, nor the purpose of the study. In addition, no control group was used (p. 337). That is, the Hawthorne effect may impact the results. This means the change in PSQI may have been because students knew they were in the study, and not due to the reduction in blue light. The use of a control group would have been useful.
 Participants for the study were chosen 'on voluntary basis' (p. 336), so the sample was not a random sample. The study may not be externally valid, but the students in the study probably would not be very different from students who did not volunteer for the study.
 Since the response (PSQI) is completed by participants completing a subjective questionnaire, the placebo effect may be of concern (using objective measures is better when possible).
 Participants completed the questionnaire preintervention, then 'used the app for one month period [and] at the end they completed PSQI once again'. This suggests that the carryover effect may be an issue, since no random allocation was used to decide which situation was evaluated first.
 Scores are only reported for before and after intervention, not for the changes themselves.
 No potential confounding variable are identified.
The authors identify limitations of the study as (p. 341):
... lack of generalization to other population groups, the lack of control group, the very nature of questionnaire as subjective instrument, duration of intervention, difference in devices used as well as usage time, confounding by other light sources at night.
These include the acknowledgement of potential confounding variables.
40.3 Chapter summary
The six steps of research can be used as a scaffold for critiquing research articles. Starting by reading the Abstract (or Summary) for an overview, then the Conclusion, and then skim the rest of the article (perhaps focusing on graphs and tables of results). If necessary, read the paper for details if needed.
40.4 Quick review questions
Are these statements true or false?
 The best way to read an article is to read the whole article thoroughly, from start to finish.
 The six steps of research are a useful scaffold for critiquing an article.
 Critiquing an article means to find all the problems.
40.5 Exercises
Answers to oddnumbered exercises are available in App. E.
Exercise 40.1 Duncan et al. (2018) examined the accuracy of step counts, as recorded on iPhones. The paper states that participants
... were recruited through word of mouth and posters displayed around the [researcher's] university. Participants were eligible if they were ambulatory, \(\ge 18\) years of age, and owned an iPhone 6 [...] or newer model.
 How would you describe the sampling method? What is the implication?
 How would you describe the information given about the subjects needing to be ambulatory and 18 years of age or over?
Although \(33\) participants were selected, the authors note some parts of the study used a smaller sample size because one subject lost their phone, while others chose to withdraw from the study.
 Why did the authors discuss the changes in sample size for some parts of the study?
The paper notes that previous studies have been able to:
[...] demonstrate the accuracy of the iPhone pedometer function in laboratory test conditions. However, no studies have attempted to evaluate evidence [...] in the field.
Describe the issue that the authors raise with previous studies, using the language in this book.
Among many other things, the researchers compared the mean difference between the number of step counts recorded by manually counting steps (mean: \(92.6\)) and the iPhonerecorded number of steps (mean: \(85.4\)). What statistical test would be appropriate?
What hypotheses are being tested?
While walking at \(2.5\) km.h^{\(1\)}, the above test resulted in \(t = 2.95\). What is the approximate \(P\)value? Interpret the results.
The sample size for the part of the study mentioned above was \(n = 32\). Is the test statistically valid?
Exercise 40.2 Mohammadpoorasl et al. (2018) studied the relationship between hearing loss, and headphone and earphone use among Iranian students, using a nondirectional study. The article states that:
... \(890\) students were randomly selected from five schools at QUMS (Medicine, Dentistry, Nursing and Midwifery, Public Health, and Paramedical Sciences schools) using a proportional cluster sampling method...
Only \(866\) of the \(890\) students agreed to participated in the study; of these, \(745\) used earphones. The participants completed a hearing test and a Hearing Loss Questionnaire (HLQ; values between \(17\) and \(34\): higher scores indicating more severe hearing loss).
 What is the population?
 Is this an observational or experimental study?
 Critique the sampling method. What is the implication for interpreting the results of the study?
One question in the HLQ is:
Does a hearing problem cause you difficulty when listening to TV or radio?
 What is a potential problem with this question?
 Compute the \(95\)% confidence interval for the proportion of students who had used earphones.
Some of the results are presented in Table 40.1, including a comparison of the mean HLQ scores for females and males.
 What statistical test was appropriate for comparing the mean scores for males and females?
 What are the hypotheses being tested?
 What is the standard error for the difference between the means?
 Perform the hypothesis tests; what do the results mean?
 Compute the approximate \(95\)% confidence interval for the difference between the means.
 Are the test and the CI statistically valid?
Table 40.1 also compares the HLQ scores for the frequency of earphone use specifically.
 What are the hypotheses being tested?
 What is the sample size for this comparison only \(791\) and not \(845\)?
 Interpret the \(P\)value for this test; what do the results mean?
Table 40.1 also compares the HLQ scores for those who use and do not use earphones.
 Form an approximate \(95\)% CI for the mean hearing loss score for students who use earphones.
 Compute the standard error of the difference between the mean hearing loss score for students who use and do earphones.
 Perform a hypothesis tests to compare the difference between the mean hearing loss score for students who use and do not use earphones, and confirm that the \(P\)value is indeed very small.
Levels  Sample size  Mean  Std dev.  \(P\)value 

Sex  
Female  \(543\)  \(19.37\)  \(2.91\)  \(0.009\) 
Male  \(302\)  \(19.99\)  \(3.51\)  
Frequency of earphone use  
\(0\), \(1\) times/day  \(194\)  \(19.20\)  \(2.87\)  \(0.001\) 
\(2\) to \(3\) times/day  \(319\)  \(19.60\)  \(2.66\)  
More than \(3\) times/day  \(278\)  \(20.20\)  \(3.54\)  
Earphone use  
Yes  \(745\)  \(19.80\)  \(3.08\)  \(< 0.001\) 
No  \(100\)  \(19.00\)  \(1.71\) 
Exercise 40.3 Mesrkanlou et al. (2023) studied the effect of an earthquake on pregnant mothers in Varzaghan, Iran (p. 2), using:
... \(1000\) cases of pregnant women living in urban and rural areas of Varzaghan city that consisted of \(550\) preearthquake and \(450\) postearthquake cases.
The researchers compared the mothers in the two groups (pre and postearthquake) on various measurements. For example, the mean age of mothers in the pregroup was \(25.82\) y, and the postgroup was \(26.71\) y; the difference has a \(P\)value of \(0.084\).
 What does this result mean?
 Why did the researchers make this comparison?
 What type of hypothesis test was used to make this conclusion?
The researchers also compared the mean birth weights of the babies born to the mothers in the two groups. In the pregroup, the mean birth weight was \(3.25\) kg (\(s = 0.52\)) and in the postgroup the mean birth weight was \(3.18\) kg (\(s = 0.54\)).
 Compute the standard error for comparing the difference between the two means.
 Perform a hypothesis test to compare the mean birthweights.
 What does this result mean?
 The research give the (twotailed) \(P\)value for this test as \(0.001\). Is this consistent with your calculations?
The researchers also compared the percentage of babies with a Low Birth Weight (LBW; less than \(2.5\) kg). For the pregroup, the percentage was \(6.01\)%; for the postgroup, the percentage was \(8.92\)%.
 What type of definition is given for LBW?
 Construct the twoway table for displaying these data.
 What type of test was probably used for this comparison?
 For the test, \(\chi^2 = 3.052\). Deduce the equivalent \(z\)score and the approximate \(P\)value.
 What limitations can you identify for this study?
Exercise 40.4 Tracy, Oster, and Beaver (1990) studied the selenium (Se) concentration in irrigation and stock water sources in California. For drinking water, the maximum recommended concentration was \(10\) \(\mu\)g.L^{\(1\)}; for irrigation water, the maximum recommended concentration was \(20\) \(\mu\)g.L^{\(1\)}
Part of the study examined the area within \(5\) k of wells. When Pliocene rocks were within this radius, the relationship between the Se concentration \(y\) in the water and the electrical conductivity of the water \(x\) (in deciSiemens per meter, dS.m^{\(1\)}) was \[ \hat{y} = 3.1 + 7.0x, \] where \(R^2 = 27\)%.
 What is the value of the correlation coefficient?
 Interpret the meaning of \(R^2\).
 The \(P\)score for testing the slope is given as \(<0.001\). Interpret what this means in this context.
 What are the units of the slope?
For the \(n = 151\) wells in the study, Table 40.2 shows the selenium concentration of the water and the geology within \(5\) km of the well.
 What hypotheses are being tested by the table?
 The article states that \(\chi^2 = 31.5\). What is the equivalent \(z\)score for the test?
 What is the approximate \(P\)value for the test?
 Interpret what this means in this context.
No  Yes  

Se concentration \(\le 2\) \(\mu\)g.L\(^{1}\)  \(78\)  \(15\) 
Se concentration \(> 2\) \(\mu\)g.L\(^{1}\)  \(23\)  \(35\) 
Exercise 40.5 M. C. Russell (2023) compared the larvae of two types of mosquitoes: Ae. albopictus (an invasive specie) and Cx. pipiens (a native species). One study compared the survival rates of the larvae at two temperatures (p. 4):
The probability of survival among Ae. albopictus control larvae was \(86.8\)% at \(15\)^{o}C and \(86.1\)% at \(25\)^{o}C and did not differ significantly based on temperature (\(\text{$p$value} = .8076\)).
 What type of test was probably used?
 Interpret what the \(P\)value means in this context.
The researchers also compared the size of the surviving larvae (p. 4 and 5):
The results of a Welch two sample \(t\)test showed that surviving Cx. pipiens control larvae were significantly larger than surviving Ae. albopictus control larvae (\(\text{mean}\pm\text{SD}\): Cx. pipiens\({} = 1.64 ± 0.18\) mm, Ae. albopictus\({} = 1.36 ± 0.13\) mm; \(\text{$p$value} =< .0001\)).
The two sample sizes are \(n = 410\) and \(n = 498\) respectively.
 How would these results be interpreted?
 What type of test would probably have been used?
 Compute the standard error for the difference between the two types of mosquitoes.
 Compute the \(t\)score and approximate \(P\)value for the test. What does the mean?
 Is the \(P\)value in the article consistent with your calculations?
 Is the test statistically valid?
The length of the surviving larvae from both species were compared for the two temperatures also (p. 5; line break added):
Within Cx. pipiens, surviving control larvae were significantly larger from replicates held at \(15\)^{o}C, relative to those from \(25\)^{o}C (\(\text{mean} \pm \text{SD}\): \(15\)^{o}C\({} = 1.66 \pm 0.01\) mm, \(25\)^{o}C\({}= 1.60 \pm 0.02\) mm; \(\text{$p$value} = .0065\)).
For Ae. albopictus surviving control larvae, there was no significant difference in length due to temperature (\(\text{mean}\pm\text{SD}\): \(15\)^{o}C\({}= 1.35 \pm 0.01\) mm, \(25\)^{o}C\({} = 1.36 \pm 0.01\) mm; \(\text{$p$value} = .4343\)).
 How would these results be interpreted?
 What type of test would probably have been used?
The linear association between predation efficiency (\(y\); as a percentage) and predatorprey sizeratio (\(x\); no units) was found (using \(n = 45\)) to be \[ \hat{y} = 19.56 + 31.64x. \] The standard errors of the two regression coefficients were \(17.92\) (intercept) and \(13.88\) (slope).
 Find an approximate \(95\)% confidence interval for each regression parameter.
 Estimate the \(P\)value for testing if the population slope is zero or not. Interpret what this means.
 Is this test statistically valid?
 Interpret the meaning of the slope.
 The value of \(R^2\) was given as \(0.087\) (i.e., \(8.7\)%); interpret this value,
 Find the value of the correlation coefficient, \(r\).
Exercise 40.6 Li, Jia, and Zhang (2017) studied the maximum mouth opening (MMO; in mm) for \(452\) Chinese adults aged from \(20\) to \(35\).
 Would the individuals in the study likely have been blinded? Explain. What are the implications?
The correlation between height and MMO was given as \(r = 0.54\) with \(P < 0.001\).
 What does this mean?
 Compute and interpret the value of \(R^2\).
The regression equation relating the height \(x\) (in cm) and MMO \(y\) was given as \(\hat{y} = 0.36x  10.15\).
 Interpret the estimates of the regression parameters.
 Use the regression equation to predict the MMO for a person \(179\) cm tall.
The paper states that the mean MMO of males was \(54.18\) mm (\(s = 5.21\)), and for females was \(49.62\) mm (\(s = 3.69\)).
 What type of hypothesis tests would have been used to compare the mean MMO for males and females?
 The paper reports the \(t\)score for comparing MMO for males and females as \(t = 10.63\). What would the \(P\)value be?
 Is this results statistically valid?
 What is the meaning of this comparison?
 Is gender likely to be a confounding variable in this regression analysis? Explain carefully.
The authors state one of the limitations as:
First, participants were recruited from a pool of people who were undergoing regular medical examinations in our hospital [...]
 What does this mean? What are the implications? Are there other limitations?
Exercise 40.7 Drinkwater et al. (1995) compared tomatoes growing on conventional (CNV; \(n = 14\)) and organic (ORG; \(n = 17\)) farms. The researchers (p. 1100):
[...] sampled tomato fields between April and September in 1989 and 1990 [...] Each of these sampling areas was divided into \(20\) sections, and \(1\) m of the tomato row was selected randomly within each section... p. 1103:
 Explain what type of sampling is being used.
One important measure of soil health is the number of actinomycetes. Researchers found (p. 1103):
... total numbers of actinomycetes isolated from tomato rhizosphere soils were significantly larger in the ORG soils compared to CNV (Student's \(t\) test, \(t = 5.4\), \(P = 0.006\))...
 What type of test would probably have been used to reach this conclusion?
 Explain what the results mean.
 Are the results statistically valid?
The researchers also found that (p. 1103):
... starch hydrolyzing actinomycetes were more numerous in CNV than in ORG soil (Student's \(t\) test, \(t = 4.0\), \(P = 0.005\)).
 What type of test would probably have been used to reach this conclusion?
 Explain what these results mean.
They also found that (p. 1103):
Total actinomycete abundance [was] negatively correlated with corky root [a disease] severity (\(r = 0.76\), \(P = 0.08\);...).
 Explain what these results mean.
 Compute and interpret the value of \(R^2\).
Exercise 40.8 Teo et al. (2022) studied pregnant Malaysian women with sleeping disruptions in the last month of pregnancy. The \(56\) patients were (p. 1):
... randomized to the use of eyemask and earplugs or "sham" headbands during night sleep (both introduced as sleep aids).
Thus, two groups were used: one using eyemasks and earplugs (treatment group, T; \(n = 29\)) and one using headbands (control or placebo group, P; \(n = 27\)).
 What was the purpose of using 'sham' headbands if it was an ineffective intervention?
 What type of study is this: experimental or observational? Explain.
Sleep duration was measured in Week 1 (no intervention) and again in Week 2 (with the allocated intervention) for each subject, using a 'wrist actigraphy monitor'.
 Why is using a 'wrist actigraphy monitor' better than asking subjects for selfreported sleep duration?
The women in the two groups were compared. For example, the mean age of the women was \(30.6\) y (\(s = 3.6\)) (T) and \(30.1\) y (\(s = 3.3\)) (P); the \(P\)value for the comparison was given as \(P = 0.56\).
 Why was this comparison made?
 Compute the standard error for the difference between the two mean sleep durations.
 Compute the \(t\)score for the test.
 Is the quoted \(P\)value consistent with your calculations?
 What do these results mean?
 Is the result statistically valid?
Another comparison was the room 'condition' where the women slept: in the treatment group, \(13\) had a room with a fan (\(16\) had air conditioning), while in the control group \(10\) women had a fan (and \(17\) air conditioning). The \(P\)value for the comparison was given as \(P = 0.60\).
 Why was this comparison made?
 Construct the twoway table summarising the data.
 The \(\chi^2\)score for the test is \(0.35064\). Compute the equivalent \(z\)score.
 Is the quoted \(P\)value consistent with your calculations?
 What do these results mean?
 Is the result statistically valid?
For the subjects in the treatment group, the mean sleep duration in Week 1 was \(279.0\) mins (\(s = 18.9\)) and in Week 2 was \(303.6\) mins (\(s = 18.8\)). The increase was \(24.7\) mins (\(s = 14.9\)).
 Conduct the appropriate statistical test to determine if sleep duration increased in the treatment group.
 What do these results mean?
For the subjects in the control group, the mean sleep duration in Week 1 was \(286.3\) mins (\(s = 20.9\)) and in Week 2 was \(301.9\) mins (\(s = 21.8\); \(n = 26\) since one observation was not available). The increase was \(18.1\) mins (\(s = 17.3\)).
 Conduct the appropriate statistical test to determine if sleep duration increased in the treatment group.
 What do these results mean?
 What could explain the increase in sleep duration, despite the control group using an ineffective intervention?
The increase in sleep duration can be compared for the two groups.
 Compute the standard error for difference between the mean increases \(\text{s.e.}( \bar{x}_T  \bar{x}_T)\).
 Construct an approximate \(95\)% confidence interval for the difference in the increase in sleep duration for the two groups.
 Conduct a hypothesis test to compare the increase in sleep duration for the two groups. Interpret what this means.
 Is the test statistically valid?