Chapter 5 Week 11 – Nov 18 2021 class meeting

This week, our goals are to…

Present findings of a qualitative research project.

Interpret the results of logistic regression models.
Review quantitative research skills that we did and did not practice.

5.1 Before class

5.1.1 Checklist – Complete by Nov 18

By our class meeting on Thursday November 18 2021, you should complete the following tasks:

Finish preparing your presentation of qualitative research project findings. Please refer to the recent in-class qualitative activity for details. Remember that you should send your presentation slides to Anshul by noon on Monday November 15 2021 so he can give you any feedback before you present.
Quantitative Assignment #4
See this list of required deliverables, if you would like, to make sure you have completed everything.

This is all you need to do before we meet for class. If anything is unclear or you have any questions do not hesitate to email Anshul at akumar@mghihp.edu or contact me by phone. I am also available by appointment to meet to go over anything.

It is fine to work with others on the assignments (and sometimes it may even be required), but make sure you state who you worked with at the top of your assignment.

5.1.2 Quantitative Assignment #4

Please do this assignment on paper or on the computer—whichever you prefer—by the start of our next class together. Please turn it in to the appropriate D2L dropbox and have a copy with you in class (either physical paper or electronic are both fine).

We will focus on the method of logistic regression, about which you watched a video in a previous assignment. As a reminder, logistic regression is used when our outcome of interest (dependent variable) falls into two²⁹ categories. Below, we will go through an example together, still using the GSSvocab data that we have used before.

Research question: Is native born status associated with gender, vocabulary score, and/or age?

Dependent variable: nativeBorn

Independent variables: female, vocab, age

In our research design above, we are asking a question that has an outcome of interest—native born status—which is binary, meaning that there are only two options, yes and no. Therefore, this question is well-suited to answer using logistic regression.

Below, the output of a logistic regression model that addresses this question is presented.

Observations	1856
Dependent variable	nativeBorn
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	44.67
Pseudo-R² (Cragg-Uhler)	0.04
Pseudo-R² (McFadden)	0.03
AIC	1361.90
BIC	1384.00

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	1.58	0.90	2.77	1.59	0.11
female	0.85	0.64	1.13	-1.10	0.27
vocab	1.26	1.17	1.35	6.30	0.00
age	1.01	1.00	1.01	1.40	0.16
Standard errors: MLE

Before we interpret the results of this logistic regression, let’s recall how we interpret the results of a linear regression. In linear regression, we estimate one slope for each independent variable (X). Then we write a sentence like this: For each one unit increase in X, Y is predicted to change by [slope], controlling for all other independent variables. We are trying to understand how Y (the dependent variable) changes each time we add 1 to X. The estimated slope is additive and tells this to us. For a one unit increase in X, the estimated slope is added to Y.

In logistic regression, we are also trying to understand how Y changes for each one unit change in X. The difference is that Y is now calculated as a likelihood of one thing (like being native born) being the case rather than the other (like not being native born). And instead of calling our estimate a slope, we will call it an odds ratio. This time, unlike in linear regression, the estimated slope is multiplicative. For a one unit increase in X, the likelihood of Y is multiplied by the odds ratio. Some examples follow.

Here’s how we interpret some of the odds ratios from the output above, looking at the exp(Est.) column in the output:

For each one-unit increase in female gender (which in this case represents a “transition” from male to female), the chance of being native born changes by 0.85 times, controlling for vocabulary score and age. Another way to write this is that women are 0.85 times as likely as men to be native born, controlling for vocabulary score and age. Since 0.85 is less than 1, and when you multiply something by a number less that 1 it gets smaller, women are less likely to be native born than men.
For each one-unit increase of vocabulary score, the chance of being native born changes by 1.26 times, controlling for gender and age. Since 1.26 is more than 1, and when you multiply something by a number greater than 1 it gets bigger, those with higher vocabulary scores are more likely to be native born than those with lower vocabulary scores.
For each one-year increase in age, the chance of being native born changes by 1.01 times, controlling for gender and vocabulary score. This means that the relationship between native born status and age is nearly nonexistent, because if you multiply something by 1.01, it stays almost the same.

Note that the odds ratios above tell us about the average relationships in our sample of 1856 observations (people) in the GSSvocab dataset. They do not tell us about the population that this sample represents. We will get to that later.

In linear regression, when the estimated slope (of the relationship between the dependent variable and an independent variable) is less than 0, it is a negative association. When this slope is greater than 0, it is a positive association. And when the slope is equal to 0, there is no association. This is because when we add a number greater than 0 to something, it increases. When we add a number less than 0 to something, it decreases. When we add 0 to something, it stays the same. So, for linear regression, the cut-off point between a negative and positive relationship is a slope of 0.

But for logistic regression, where we multiply instead of add, when the estimated odds ratio (of the relationship between the likelihood of the dependent variable and an independent variable) is less than 1, it is a negative association. When this odds ratio is greater than 1, it is a positive association. And when the odds ratio is equal to 1, there is no association. This is because when we multiply a number greater than 1 with something, it increases. When we multiply a number less than 1 with something, it decreases. When we multiply 1 with something, it stays the same. So, for logistic regression, the cut-off point between a negative and positive relationship is an odds ratio of 1.

This table summarizes what is written above:

Type of association	Linear regression slope	Logistic regression odds ratio
positive	>0	>1
none	0	1
negative	<0	<1

Now it’s time to look at some of the other parts of the regression output table. You’ll see columns labeled 2.5% and 97.5%. These show the 95% confidence interval of the estimated odds ratio (which is in the exp(Est.) column) in the population that this sample represents. Below is an interpretation of these confidence intervals for each of our three independent variables:

female: 0.64–1.13. We are 95% certain that in the population from which this sample was drawn, the odds ratio quantifying the relationship between the likelihood of being native born and being female (compared to being male) is no lower than 0.64, which would be a negative relationship. We are also 95% certain that in the population, the very same odds ratio is no higher than 1.13. Since 0.64 is less than 1, that suggests a possible negative relationship between likelihood of being native born and being female (compared to being male). Since 1.13 is greater than 1, that suggests a possible positive relationship between likelihood of being native born and being female (compared to being male). So, which one is it? Positive or negative (or no association)? We don’t know. We cannot say. So we conclude that we did not find any evidence of an association between native born status and gender.
vocab: 1.17–1.35. We are 95% certain that in the population from which this sample was drawn, the odds ratio quantifying the relationship between the likelihood of being native born and a one-unit increase in vocabulary score is no lower than 1.17, which would be a positive relationship. We are also 95% certain that in the population, the very same odds ratio is no higher than 1.35. Since 1.17 is more than 1, that suggests a possible positive relationship between likelihood of being native born and a one-unit increase in vocabulary score. Since 1.35 is greater than 1, that also suggests a possible positive relationship between likelihood of being native born and a one-unit increase in vocabulary score. This time, our 95% confidence interval is all above 1. So we conclude that we are more than 95% confident that a positive association exists between native born status and vocabulary score.
age: 1.00–1.01. We are 95% certain that in the population from which this sample was drawn, the odds ratio quantifying the relationship between the likelihood of being native born and a one-unit increase in age is no lower than 1.00, which would be no relationship. We are also 95% certain that in the population, the very same odds ratio is no higher than 1.01. Since 1.00 is equal to 1, that suggests no relationship between likelihood of being native born and a one-unit increase in age. Since 1.01 is greater than 1, that suggests an extremely small possible positive relationship between likelihood of being native born and a one-unit increase in age. Our 95% confidence interval again includes 1 and is all extremely close to 1. So we conclude that we did not find any evidence of an association between native born status and age.

Let’s turn to analyzing the goodness-of-fit of this logistic regression model. Keep in mind that for linear regression, we learned about the $R^2$ metric, which tells us the proportion of variation in the dependent variable that is explained by the variation in the independent variables. $R^2$ can be between 0 and 1 and a higher $R^2$ is better than a lower one.

For logistic regression, we cannot calculate an $R^2$ metric the same way. Instead, we can calculate something called $\text{Pseudo-}R^2$ , which approximates what the $R^2$ would have been if this was a linear regression model. There are different ways of calculating $\text{Pseudo-}R^2$ . Two versions of $\text{Pseudo-}R^2$ —called Cragg-Uhler and McFadden—are given in the logistic regression output above. They tell us that the proportion of variation in the dependent variable that can be explained by the variation in the independent variables is only 0.04 (Cragg-Uhler) or 0.03 (McFadden). That is very low.

Now it’s your turn to interpret some logistic regression results! Here’s a new scenario: we have a dataset of 403 observations (people) who have been tested for diabetes and for whom we know many other characteristics. Here is a preview of the first few rows of this data:

diabetes_diagnosis	age	gender	BMI
no diabetes	46	female	22.12877
no diabetes	29	female	37.41553
no diabetes	58	female	48.36549
no diabetes	67	male	18.63600
diabetes	64	male	27.82202
no diabetes	34	male	26.49673
no diabetes	30	male	28.20269
no diabetes	37	male	34.33209
no diabetes	45	male	24.51124
no diabetes	55	female	35.77879

Research question: Is diabetes diagnosis associated with age, gender, and/or BMI?

Dependent variable: diabetes_diagnosis

Independent variables: age, gender, BMI

In our research design above, we are asking a question that has an outcome of interest—diabetes diagnosis—which is binary, meaning that there are only two options, no diabetes and diabetes. Therefore, this question is well-suited to answer using logistic regression.

Below, the output of a logistic regression model that addresses this question is presented.

Observations	384 (19 missing obs. deleted)
Dependent variable	diabetes_diagnosis
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	44.40
Pseudo-R² (Cragg-Uhler)	0.19
Pseudo-R² (McFadden)	0.14
AIC	289.62
BIC	305.43

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.00	0.00	0.01	-6.91	0.00
age	1.06	1.04	1.08	5.61	0.00
genderfemale	0.78	0.42	1.47	-0.76	0.45
BMI	1.08	1.03	1.13	3.17	0.00
Standard errors: MLE

Please answer the following questions about the logistic regression output above, which addresses our research question about diabetes.

Task 1: For just our sample, interpret the odds ratio for age.

Task 2: What do we know about the relationship between the likelihood of having diabetes and age, in the population that our sample represents?

Task 3: For just our sample, interpret the odds ratio for genderfemale.

Task 4: What do we know about the relationship between the likelihood of having diabetes and gender, in the population that our sample represents?

Task 5: For just our sample, interpret the odds ratio for BMI.

Task 6: What do we know about the relationship between the likelihood of having diabetes and BMI, in the population that our sample represents?

Task 7: How well does our logistic regression model fit our data?

You have reached the end of this week’s quantitative homework assignment.

5.2 In class

5.2.1 Schedule

November 18 2021

1:00 p.m. – Brief introduction, start quantitative methods quiz
1:20 p.m. – Qualitative research project presentations (15 minutes per team)
2:20 p.m. – Break
2:25 p.m. – Quantitative activity (choice between review of methods so far or introduction to predictive analytics)
2:50 p.m. – End of class

5.2.2 Qualitative project presentations

Each team will present for 10 minutes and then spend 5 minutes answering questions, according to the following schedule.

Start time	Team
1:20 p.m.	Nucleo-tides
1:35	MJK
1:50	Genie Beanies
2:05	Locus Pocus

5.2.3 Quantitative Activity

In this activity, we will have the option to choose as a class between reviewing methods we have learned together so far or learning about the basics of predictive analytics.

5.2.3.1 Option 1 – Review of quantitative methods process

In this activity, we will spend 25 or 30 minutes together in a discussion, without splitting into groups, to answer any questions you have and synthesize some of the quantitative methodology that we have covered in the last few weeks.

So far, these are some of the quantitative methods we have learned to describe and/or analyze our data:

Histograms and bar charts
Scatterplots
Descriptive statistics
T-tests
Pivot tables
Linear regression in Excel and interpretation of resulting output
Logistic regression interpretation

And here are some important topics that we did not have time to cover:

Testing assumptions for t-tests, linear regression, and logistic regression.
Relating quantitative research questions to quantitative research methods.
Handling clustered data.
Detecting and accounting for bias in our quantitative results.

Together today, we will do as much of the following procedure as we can. You should bring up any questions you have at any point.

Step 1: Identify research questions that are relevant to genetic counseling research that might require quantitative methods to answer.

Step 2: Pick one or two research questions to examine carefully.

Step 3: Identify the outcome(s) of interest for the selected research question(s). At least one should be continuous numeric and one should be binary.

Step 4: Describe the data we would need to answer the research question(s). How would this data be structured?

Step 5: What type of quantitative analysis tool(s) would help us answer our selected research question(s).

Step 6: What would be the capabilities and limitations of this approach?

5.2.3.2 Option 2 – Introduction to predictive analytics

Learning objectives

To the extent out time together permits, our goals are to…

Gain exposure questions that can be answered using predictive analytics and machine learning.
Distinguish between questions answered using traditional statistics and predictive analytics.
Brainstorm about how to leverage data to predict outcomes.
Raise ethical concerns and questions about using analytics to solve problems.

Activity outline

Look at data together in Excel.
Identify a traditional statistics research question that can be answered with this data.
Identify a predictive analytics research question that can be answered with this data.
Simulate the analytics process in Excel.
Discuss results, applications of results, and ethical challenges.

Sometimes more than two, but typically just two.↩︎