3.4 Translating a Question into a Data Problem

Another aspect to consider when you’re developing your question is what will happen when you translate it into a data problem. Every question must be operationalized as a data analysis that leads to a result. Pausing to think through what the results of the data analysis would look like and how they might be interpreted is important as it can prevent you from wasting a lot of time embarking on an analysis whose result is not interpretable. Although we will discuss many examples of questions that lead to interpretable and meaningful results throughout the book, it may be easiest to start first by thinking about what sorts of questions don’t lead to interpretable answers.

The typical type of question that does not meet this criterion is a question that uses inappropriate data. For example, your question may be whether taking a vitamin D supplement is associated with fewer headaches, and you plan on answering that question by using the number of times a person took a pain reliever as a marker of the number of headaches they had. You may find an association between taking vitamin D supplements and taking less pain reliever medication, but it won’t be clear what the interpretation of this result is. In fact, it is possible that people who take vitamin D supplements also tend to be less likely to take other over-the-counter medications just because they are “medication avoidant,” and not because they are actually getting fewer headaches. It may also be that they are using less pain reliever medication because they have less joint pain, or other types of pain, but not fewer headaches. Another interpretation, of course, is that they are indeed having fewer headaches, but the problem is that you can’t determine whether this is the correct interpretation or one of the other interpretations is correct. In essence, the problem with this question is that for a single possible answer, there are multiple interpretations. This scenario of multiple interpretations arises when at least one of the variables you use (in this case, pain reliever use) is not a good measure of the concept you are truly after (in this case, headaches). To head off this problem, you will want to make sure that the data available to answer your question provide reasonably specific measures of the factors required to answer your question.

A related problem that interferes with interpretation of results is confounding. Confounding is a potential problem when your question asks about the relationship between factors, such as taking vitamin D and frequency of headaches. A brief description of the concept of confounding is that it is present when a factor that you were not necessarily considering in your question is related to both your exposure of interest (in the example, taking vitamin D supplements) and your outcome of interest (taking pain reliever medication). For example, income could be a confounder, because it may be related to both taking vitamin D supplements and frequency of headaches, since people with higher income may tend to be more likely to take a supplement and less likely to have chronic health problems, such as headaches. Generally, as long as you have income data available to you, you will be able to adjust for this confounder and reduce the number of possible interpretations of the answer to your question. As you refine your question, spend some time identifying the potential confounders and thinking about whether your dataset includes information about these potential confounders.

Another type of problem that can occur when inappropriate data are used is that the result is not interpretable because the underlying way in which the data were collected lead to a biased result. For example, imagine that you are using a dataset created from a survey of women who had had children. The survey includes information about whether their children had autism and whether they reported eating sushi while pregnant, and you see an association between report of eating sushi during pregnancy and having a child with autism. However, because women who have had a child with a health condition recall the exposures, such as raw fish, that occurred during pregnancy differently than those who have had healthy children, the observed association between sushi exposure and autism may just be the manifestation of a mother’s tendency to focus more events during pregnancy when she has a child with a health condition. This is an example of recall bias, but there are many types of bias that can occur.

The other major bias to understand and consider when refining your question is selection bias, which occurs when the data your are analyzing were collected in such a way to inflate the proportion of people who have both characteristics above what exists in the general population. If a study advertised that it was a study about autism and diet during pregnancy, then it is quite possible that women who both ate raw fish and had a child with autism would be more likely to respond to the survey than those who had one of these conditions or neither of these conditions. This scenario would lead to a biased answer to your question about mothers’ sushi intakes during pregnancy and risk of autism in their children. A good rule of thumb is that if you are examining relationships between two factors, bias may be a problem if you are more (or less) likely to observe individuals with both factors because of how the population was selected, or how a person might recall the past when responding to a survey. There will be more discussion about bias in subsequent chapters on (Inference: A Primer and Interpreting Your Results), but the best time to consider its effects on your data analysis is when you are identifying the question you will answer and thinking about how you are going to answer the question with the data available to you.