So far, you have learnt to ask a RQ, identify different ways of obtaining data, and design the study.
In this chapter, you will learn how to ensure that the conclusions we can make are logical and sound. You will learn to:
- identify issues that might negatively impact internal validity.
- identify issues that might impact the values of the response variables.
- identify extraneous, confounding and lurking variables.
Consider the letter-typing RQ again (from Example 5.4), where this RQ was posed:
For students in this course this semester, is the average number of words typed per minute on a keyboard the same for females and males?
For this study:
- P: Students in this course this semester.
- O: Average number of letters typed per minute (say, 'typing speed').
- C: Between females and males.
- I: None. (The values of C cannot be allocated to students).
After measuring the typing speed (the response variable) of many individuals, a lot of variation will be observed in the values collected: It is very unlikely that every student in the study will record the same typing speed.
The measured typing speeds can be influenced by many issues (Fig. 6.1):
- The explanatory variable (Sect. 6.3): The values of the explanatory variable may influence the values of the response variable; of course, they may not either. The purpose of the study is to find out... In this example, the explanatory variable is the sex of the student.
- Other variables (Sect. 6.4): Other variables that aren't the focus of the study may influence the response variable (and perhaps have more influence than the explanatory variable of interest) such as 'age' or 'whether or not the person wears glasses'. We can work with these other variables if we are careful.
- Design issues (Sect. 6.5): The way in which the study is designed can also influence the values of the response variable. These can mean disaster if not handled properly.
- Chance, or randomness (Sect. 6.6): Even the same person doing the same thing repeatedly will not record exactly the same reaction time every attempt. This is unavoidable, but we can live with it if we have some idea of the how large this variation is.
The purpose of the study is to explore the relationship between the response variable and the explanatory variable... but, as we just saw, many other issues can obscure that relationship.
Recall that internal validity refers to the strength of the conclusions that supports the relationship between the response and explanatory variables. High internal validity means that other explanations have been ruled out: changes in the response variable can be attributed to changes in the explanatory variables.
That is, an internally-valid study is one where the impacts of other possible explanations for that association (such as extraneous variables, design issues, and chance) have been accounted for, minimised, or are well-managed.
Internal validity is one of the most important properties of scientific studies, and is an important concept in reasoning about evidence more generally. Internal validity is determined by how well a study can rule out alternative explanations for its findings.
Anything that compromises the internal validity is called a bias. Bias refers to any misrepresentation that can lead to a false conclusion, and may occur intentionally or unintentionally.
Definition 6.1 (Bias) Bias compromises the results or inferences in a study, which may lead to inaccurate conclusions.
Bias can be introduced at any part of the research process, and may occur consciously or unconsciously.
The goal of study design is to maximise internal validity: to design a study to isolate the relationship of interest, by eliminating, as well as possible, all other possible explanations.
Bias may occur during:
The Catalogue of Bias list over 60 different ways that studies can be biased though, in this book, we will only focus on small number of possible biases to consider when designing studies.
We have already met one type of bias: selection bias.
The explanatory variable may be associated with changes in the values of the response variable. However, it may not; after all, determining this (or the extent of this) is the purpose of the study.
If nothing else influenced the values of the response variable, life would be easy: Any change of a given size in the value of the explanatory variable would always results in change of the same size in the value of the response variable.
Example 6.1 (Explanatory variable) In the typing-speed study (Sect. 6.1), the explanatory variable is the sex of the person. If nothing else influenced typing speed, all females would record the same typing speed every time, and all males would record the same typing speed every time. This is clearly unreasonable.
Other variables probably exist which are associated with changes in the value of the response variable; these are called extraneous variables.
Definition 6.2 (Extranaeous variable) An extraneous variable is any variable that is (potentially) associated with the response variable, but is not the explanatory variable.
Example 6.2 In the typing-speed study (Chap. 6.1), potential extraneous variables may include age, the presence or absence of certain medical conditions, the level of familiarity with computers, etc.
All extraneous variables are, by definition, related to the response variable. An extraneous variable may or may not be associated with the explanatory variable as well. Extraneous variables may have other names too (Table 6.1), though these names are used inconsistently by researchers.147
When an extraneous variable is also related to the explanatory variable, the extraneous variable is called a confounding variable. A confounding variable can obscure the observed relationship between the response and explanatory variables (i.e., confounding variables can bias the results).
Definition 6.3 (Confounding variable) A confounding variable (or a confounder) is an extraneous variable associated with the response and explanatory variables (Fig. 6.3).
Definition 6.4 (Confounding) Confounding is when a third variable influences the observed relationship between the response and explanatory variable.
The problem with confounding is that a relationship between the response and explanatory variables may be evident, but only because both of these variables are related to the confounding variable (Fig. 6.3).
Example 6.3 (Confounding variables) A relationship exists between carrying cigarette lighters, and people contracting lung cancer: people who carry cigarette lighters are more likely to get lung cancer.
The only reason that this relationship exists is because of a confounding variable: whether or not the person is a smoker.
A smoker is more likely to carry a cigarette lighter than a non-smoker, and a smoker is also more likely to develop lung cancer than a non-smoker.
Managing confounding is very important, as confounding can completely change the relationship between the response and explanatory variables (see the example in Sect. 14.1) and hence can compromise internal validity.
Sometimes confounding variables are not measured, assessed, described or recorded; these confounding variables are then called lurking variables (Fig. 6.4). Failure to acknowledge lurking variables can lead to wrong conclusions (for example, see Sect. 14.1).
Definition 6.5 (Lurking variable) A lurking variable is an extraneous variable associated with the response and explanatory variables (that is, is a confounding variable), but whose values are not measured, assessed, described or recorded in the study.
Example 6.4 (Lurking variables) Consider the relationship between carrying cigarette lighters, and developing lung cancer (Example 6.3).
In this study, we could define:
- the response variable as "whether or not a person gets lung cancer"; and
- the explanatory variable as "whether or not a person carries a cigarette lighter".
Now consider the variable "whether or not a person is a smoker". This variable is associated with the response variable (people who smoke are more likely to get lung cancer than those who do not smoke) and with the explanatory variable (people who smoke are more likely to carry a cigarete lighter than those who do not smoke).
Hence, if that information was recorded by the researchers, it would be called a confounding variable.
In contrast, if it was not recorded by the researchers, it would be called a lurking variable (Fig. 6.5).
Now consider the variable "whether or not the person worked closely with someone who smoked". This variable is possibly associated with the response variable (someone who works closely with a smoker would be slightly more likely to get lung cancer ('passive smoking') than someone who does not),148 but is very unlikely to be associated with owning a cigarette lighter (whether or not someone owns a cigarette lighter probably doesn't depend on whether or not they work closely with a smoker).
Hence, if that information was recorded, it would be an extraneous variable (but not a confounding variable).
If that variable was not recorded, the variation it produces in the response variable would just end up as part of the chance variation.
To clarify (Table 6.1):
- Extraneous variables are all related to the response variable, by definition.
- Some extraneous variables are also called confounding variables if they are also related to the explanatory variable.
- Some confounding variables are also called lurking variables if they are not measured, assessed, described or recorded.
Some unknown extraneous variables will be associated with the response variable only, and so become part of variation due to chance (i.e., unexplained). These terms are not always used consistently by all researchers.149
|Type||Associated with response||Associated with response and explanatory|
|Measured or observed||No special name: extraneous||Confounding (not lurking)|
|Not measured or observed||Becomes part of 'chance'||Lurking|
To avoid lurking variables, researchers generally collect lots of information about the individuals in the study (such as age and sex if the study involves people) and circumstances of the individuals in the study (such as the temperature at the tiem of data collection) that may be relevant, in case they are confounding variables.
Example 6.5 (Lurking variables) Consider the relationship between the number of fatalities in an incident, and number of paramedics sent to the incident.
'Severity of the incident' is the lurking variable, since more severe accidents would have more paramedics attending (in general), and also have more fatalities (in general).
Can you think of any other possible extraneous variables in the letter-typing study (Sect. 6.1)?
Many aspects of the study design can influence the observed relationship between the response and explanatory variable (i.e., can bias the results).
Good design principles can be used to minimise the impact of these as much as possible, so the focus is on the influence of the explanatory variable on the response variable.
Since aspects of the study design may be under the control of the researchers, study design is very importance for reducing bias and improving internal validity. The study design principles are discussed at length soon (Chaps. 7 and 8).
Example 6.6 (Design) The typing-speed study (Sect. 6.1) could be poorly designed.
For example, if females were always asked to use their dominant hand, and males always asked to use their non-dominant hand, the comparison would not be equivalent for females and males.
Females would probably have a faster average time, simply because they are using their dominant hands.
Natural (chance) variation refers to variation that cannot otherwise be explained: even repeating a study exactly the same way every time on the same individuals will not always produce the same values of the response variable. This is called natural variation, chance variation, or just chance.
Natural variation makes the influence of the explanatory variable (which we are wanting to study) hard to detect, so minimizing chance variation is important. Minimising the amount of the chance variation requires using good design principles, and measuring as many other extraneous variables that may explain variation in the response variable as is reasonable.
Chance can impact the values of the response variable in different ways: each individual can produce different values of the response variable each time the individual repeats the study (within-individuals variation); each individual in the study can produce different values of the response variable compared to other individuals (between-individuals variation):
- To estimate the amount of variation within individuals: Many observations are needed from each unit of analysis (individual).
- To estimate amount of variation between individuals: Many units of analysis (individuals) are needed.
Since between-individual variation is usually more variable (i.e., larger variation) than the within-individual variation, using many individuals is usually more important than using a smaller number of individuals many times each.
Consider the letter-typing study (Sect. 6.1) again. What are the advantages and disadvantage of:
- measuring one female 30 times?
- measuring 30 different females once each?
- measuring 10 different females three times each?
In a research study, we are usually exploring relationships between a response variable and explanatory variable. However the values of the response variable can be influenced by things other than the explanatory variable, such as other variables that aren't really of interest (extraneous variable), the study design and by chance. Some extraneous variables are also related to the explanatory variable, and are called confounding variables (and are lurking variables if they cannot be measured, assessed, described or recorded).
If the study design makes it difficult to separate the relationship between the response and explanatory variable from other possible causes, the study has poor internal validity.
The Giant Mine in Yellowknife, Canada, ceased operation in 1999 after operating for 50 years, during which 237,000 tonnes of arsenic trioxide was released.
One study150 examined the arsenic concentration in lake water from 25 lakes within a 25km radius of the mine (11 years after the mine closed), to determine if the arsenic concentration was related to the distance of the lake from the mine.
They also recorded the type of bedrock (volcanic; sedimentary; grandiorite), the ecology type (lowland; upland), the elevation of the lake (in metres), the lake area (in hectares), and the catchment area (in hectares).
- What is the response variable?
- What is the explanatory variable?
- Is the variable "Catchment area" likely to be a lurking variable?
- Is the variable "Type of bedrock" likely to be a confounding variable?
- What is the best description of the variable "Ecology type"?
- What type of study is this?
Selected answers are available in Sect. D.6.
Exercise 6.1 A study examined the relationship between diet quality and depression in Australian adolescents.151 The researchers used a sample of 7114 adolescents aged 10--14 years old in their study, and also measured information about:
...age, gender, socioeconomic status, parental education, parental work status, family conflict, poor family management, dieting behaviours, body mass index, physical activity, and smoking...
After identifying the response and explanatory variables, which of these listed variable reasonably could be considered extraneous variables, confounding variables and lurking variables?
Exercise 6.2 A newspaper article153 reported on a study that found
Women who drank green tea at least three times a week were 14 per cent less likely to develop a cancer of the digestive system.
However, the final paragraph of the article notes that:
Nobody can say whether green tea itself is the reason, since green tea lovers are often more health-conscious in general.
Identify the explanatory and response variables, and explain that final sentence using terms introduced in this chapter.
Exercise 6.3 A study recorded the lung capacity (measured as Forced Expiratory Volume, or FEV, in litres) of children aged 3 to 19,154 and also recorded whether not the children were smokers. One finding from the data155 is that children who smoke have a larger average FEV (i.e. larger average lung capacity) than children who do not smoke.
Name a confounding variable that may explain this surprising finding. Would it be likely that this variable is a lurking variable?