Chapter 5 Week 11 – Nov 18 2020 class meeting

This week, our goals are to…

Present findings of a qualitative research project.
Review the use of pivot tables in Excel.
Interpret the results of logistic regression models.
Relate quantitative research questions to quantitative research methods.

5.1 Before class

5.1.1 Checklist – Complete by Nov 18

By our class meeting on Wednesday, November 18, 2020, you should complete the following tasks:

Finish preparing your presentation of qualitative research project findings. Please refer to the Week 10 in-class Qualitative Activity for details. Remember that you should send your presentation slides to Anshul by the evening of Tuesday November 17 2020 so he can give you any feedback before you present.²¹
Quantitative Assignment #4
See this list of required deliverables to date, if you would like.

This is all you need to do before we meet for class. If anything is unclear or you have any questions do not hesitate to email Anshul at akumar@mghihp.edu or contact me by phone. I am also available by appointment to meet to go over anything.

It is fine to work with others on the assignments (and sometimes it may even be required), but make sure you state who you worked with at the top of your assignment.

5.1.2 Quantitative Assignment #4

Please do this assignment on the computer and come to class on Wednesday, November 18 2020 at 10 a.m., ready to turn it in.

5.1.2.1 Review of pivot tables

In our Week 10 Quantitative Activity, we practiced making pivot tables in Excel. In the first part of this quantitative assignment, we will again review this procedure, since it is a very important skill.

Please refer to the following two resources:

Pivot Tables. Easy Excel. Click here. This is the same resource that we used in class.
One-way and two-way pivot tables in Excel. Click here or see embedded below. This is a video that demonstrates the process of making pivot tables. I have attempted to embed it within this e-book, just below.

For this part of the assignment, we will again use the GSSvocab data that we used before. It is located in D2L in Course Materials -> Content -> Week 7, 8, 10, 11: Quantitative and Qualitative Methodology. Please open this file in Excel in your computer.

Task 1: Create a one-way pivot table (simple pivot table with just one variable) showing the counts for the variable vocab.

Task 2: How many people in the data have a vocab score of 7?

Task 3: Add a column to your one-way pivot table that shows percentages of the total sample for each level of vocab.

Task 4: Create a two-way pivot table (with one variable in rows and another in columns) showing the counts for the variables vocab and nativeBorn.²²

Task 5: Add both row and column percentages to the table.

Task 6: What percentage of native born people scored a 5 on the vocabulary test?

Task 7: What percentage of people who scored a 5 on the vocabulary test were native born?

5.1.2.2 Logistic regression

Now we will turn to the method of logistic regression, about which you watched a video in a previous assignment. As a reminder, logistic regression is used when our outcome of interest (dependent variable) falls into two²³ categories. Below, we will go through an example together, still using the GSSvocab data.

Research question: Is native born status associated with gender, vocabulary score, and/or age?

Dependent variable: nativeBorn

Independent variables: female, vocab, age

In our research design above, we are asking a question that has an outcome of interest—native born status—which is binary, meaning that there are only two options, yes and no. Therefore, this question is well-suited to answer using logistic regression.

Below, the output of a logistic regression model that addresses this question is presented.

Observations	1856
Dependent variable	nativeBorn
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	44.67
Pseudo-R² (Cragg-Uhler)	0.04
Pseudo-R² (McFadden)	0.03
AIC	1361.90
BIC	1384.00

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	1.58	0.90	2.77	1.59	0.11
female	0.85	0.64	1.13	-1.10	0.27
vocab	1.26	1.17	1.35	6.30	0.00
age	1.01	1.00	1.01	1.40	0.16
Standard errors: MLE

Before we interpret the results of this logistic regression, let’s recall how we interpret the results of a linear regression. In linear regression, we estimate one slope for each independent variable (X). Then we write a sentence like this: For each one unit increase in X, Y is predicted to change by [slope], controlling for all other independent variables. We are trying to understand how Y (the dependent variable) changes each time we add 1 to X. The estimated slope is additive and tells this to us. For a one unit increase in X, the estimated slope is added to Y.

In logistic regression, we are also trying to understand how Y changes for each one unit change in X. The difference is that Y is now calculated as a likelihood of one thing (like being native born) being the case rather than the other (like not being native born). And instead of calling our estimate a slope, we will call it an odds ratio. This time, unlike in linear regression, the estimated slope is multiplicative. For a one unit increase in X, the likelihood of Y is multiplied by the odds ratio. Some examples follow.

Here’s how we interpret some of the odds ratios from the output above, looking at the exp(Est.) column in the output:

For each one-unit increase in female gender (which in this case represents a “transition” from male to female), the chance of being native born changes by 0.85 times, controlling for vocabulary score and age. Another way to write this is that women are 0.85 times as likely as men to be native born, controlling for vocabulary score and age. Since 0.85 is less than 1, and when you multiply something by a number less that 1 it gets smaller, women are less likely to be native born than men.
For each one-unit increase of vocabulary score, the chance of being native born changes by 1.26 times, controlling for gender and age. Since 1.26 is more than 1, and when you multiply something by a number greater than 1 it gets bigger, those with higher vocabulary scores are more likely to be native born than those with lower vocabulary scores.
For each one-year increase in age, the chance of being native born changes by 1.01 times, controlling for gender and vocabulary score. This means that the relationship between native born status and age is nearly nonexistent, because if you multiply something by 1.01, it stays almost the same.

Note that the odds ratios above tell us about the average relationships in our sample of 1856 observations (people) in the GSSvocab dataset. They do not tell us about the population that this sample represents. We will get to that later.

In linear regression, when the estimated slope (of the relationship between the dependent variable and an independent variable) is less than 0, it is a negative association. When this slope is greater than 0, it is a positive association. And when the slope is equal to 0, there is no association. This is because when we add a number greater than 0 to something, it increases. When we add a number less than 0 to something, it decreases. When we add 0 to something, it stays the same. So, for linear regression, the cut-off point between a negative and positive relationship is a slope of 0.

But for logistic regression, where we multiply instead of add, when the estimated odds ratio (of the relationship between the likelihood of the dependent variable and an independent variable) is less than 1, it is a negative association. When this odds ratio is greater than 1, it is a positive association. And when the odds ratio is equal to 1, there is no association. This is because when we multiply a number greater than 1 with something, it increases. When we multiply a number less than 1 with something, it decreases. When we multiply 1 with something, it stays the same. So, for logistic regression, the cut-off point between a negative and positive relationship is an odds ratio of 1.

This table summarizes what is written above:

Type of association	Linear regression slope	Logistic regression odds ratio
positive	>0	>1
none	0	1
negative	<0	<1

Now it’s time to look at some of the other parts of the regression output table. You’ll see columns labeled 2.5% and 97.5%. These show the 95% confidence interval of the estimated odds ratio (which is in the exp(Est.) column) in the population that this sample represents. Below is an interpretation of these confidence intervals for each of our three independent variables:

female: 0.64–1.13. We are 95% certain that in the population from which this sample was drawn, the odds ratio quantifying the relationship between the likelihood of being native born and being female (compared to being male) is no lower than 0.64, which would be a negative relationship. We are also 95% certain that in the population, the very same odds ratio is no higher than 1.13. Since 0.64 is less than 1, that suggests a possible negative relationship between likelihood of being native born and being female (compared to being male). Since 1.13 is greater than 1, that suggests a possible positive relationship between likelihood of being native born and being female (compared to being male). So, which one is it? Positive or negative (or no association)? We don’t know. We cannot say. So we conclude that we did not find any evidence of an association between native born status and gender.
vocab: 1.17–1.35. We are 95% certain that in the population from which this sample was drawn, the odds ratio quantifying the relationship between the likelihood of being native born and a one-unit increase in vocabulary score is no lower than 1.17, which would be a positive relationship. We are also 95% certain that in the population, the very same odds ratio is no higher than 1.35. Since 1.17 is more than 1, that suggests a possible positive relationship between likelihood of being native born and a one-unit increase in vocabulary score. Since 1.35 is greater than 1, that also suggests a possible positive relationship between likelihood of being native born and a one-unit increase in vocabulary score. This time, our 95% confidence interval is all above 1. So we conclude that we are more than 95% confident that a positive association exists between native born status and vocabulary score.
age: 1.00–1.01. We are 95% certain that in the population from which this sample was drawn, the odds ratio quantifying the relationship between the likelihood of being native born and a one-unit increase in age is no lower than 1.00, which would be no relationship. We are also 95% certain that in the population, the very same odds ratio is no higher than 1.01. Since 1.00 is equal to 1, that suggests no relationship between likelihood of being native born and a one-unit increase in age. Since 1.01 is greater than 1, that suggests an extremely small possible positive relationship between likelihood of being native born and a one-unit increase in age. Our 95% confidence interval again includes 1 and is all extremely close to 1. So we conclude that we did not find any evidence of an association between native born status and age.

Let’s turn to analyzing the goodness-of-fit of this logistic regression model. Keep in mind that for linear regression, we learned about the $R^2$ metric, which tells us the proportion of variation in the dependent variable that is explained by the variation in the independent variables. $R^2$ can be between 0 and 1 and a higher $R^2$ is better than a lower one.

For logistic regression, we cannot calculate an $R^2$ metric the same way. Instead, we can calculate something called $\text{Pseudo-}R^2$ , which approximates what the $R^2$ would have been if this was a linear regression model. There are different ways of calculating $\text{Pseudo-}R^2$ . Two versions of $\text{Pseudo-}R^2$ —called Cragg-Uhler and McFadden—are given in the logistic regression output above. They tell us that the proportion of variation in the dependent variable that can be explained by the variation in the independent variables is only 0.04 (Cragg-Uhler) or 0.03 (McFadden). That is very low.

Now it’s your turn to interpret some logistic regression results! Here’s a new scenario: we have a dataset of 403 observations (people) who have been tested for diabetes and for whom we know many other characteristics. Here is a preview of the first few rows of this data:

diabetes_diagnosis	age	gender	BMI
no diabetes	46	female	22.12877
no diabetes	29	female	37.41553
no diabetes	58	female	48.36549
no diabetes	67	male	18.63600
diabetes	64	male	27.82202
no diabetes	34	male	26.49673
no diabetes	30	male	28.20269
no diabetes	37	male	34.33209
no diabetes	45	male	24.51124
no diabetes	55	female	35.77879

Research question: Is diabetes diagnosis associated with age, gender, and/or BMI?

Dependent variable: diabetes_diagnosis

Independent variables: age, gender, BMI

In our research design above, we are asking a question that has an outcome of interest—diabetes diagnosis—which is binary, meaning that there are only two options, no diabetes and diabetes. Therefore, this question is well-suited to answer using logistic regression.

Below, the output of a logistic regression model that addresses this question is presented.

Observations	384 (19 missing obs. deleted)
Dependent variable	diabetes_diagnosis
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	44.40
Pseudo-R² (Cragg-Uhler)	0.19
Pseudo-R² (McFadden)	0.14
AIC	289.62
BIC	305.43

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.00	0.00	0.01	-6.91	0.00
age	1.06	1.04	1.08	5.61	0.00
genderfemale	0.78	0.42	1.47	-0.76	0.45
BMI	1.08	1.03	1.13	3.17	0.00
Standard errors: MLE

Please answer the following questions about the logistic regression output above, which addresses our research question about diabetes.

Task 8: For just our sample, interpret the odds ratio for age.

Task 9: What do we know about the relationship between the likelihood of having diabetes and age, in the population that our sample represents? Task 10: For just our sample, interpret the odds ratio for genderfemale.

Task 11: What do we know about the relationship between the likelihood of having diabetes and gender, in the population that our sample represents?

Task 12: For just our sample, interpret the odds ratio for BMI.

Task 13: What do we know about the relationship between the likelihood of having diabetes and BMI, in the population that our sample represents?

Task 14: How well does our logistic regression model fit our data?

You have reached the end of this week’s quantitative homework assignment.

5.2 In class

5.2.1 Schedule

November 18 2020

5.2.2 Qualitative project presentations

Each team will present for 10 minutes and then spend 5 minutes answering questions, according to the following schedule.

Start time	Team
10:00 a.m.	Coffee Genes
10:15	Rice Crispr Treats
10:30	The Heterozygoats
10:45	The Gene Team

We can take a five-minute break from approximately 11:00–11:05 a.m.

5.2.3 Quantitative Activity

Today, we will spend 25 or 30 minutes together in a discussion, without splitting into groups, to answer any questions you have and synthesize some of the quantitative methodology that we have covered in the last few weeks.

This will hopefully make up for any confusion I caused with last week’s in-class quantitative activity.

So far, these are some of the quantitative methods we have learned to describe and/or analyze our data:

Histograms and bar charts
Scatterplots
Descriptive statistics
T-tests
Pivot tables
Linear regression in Excel and interpretation of resulting output
Logistic regression interpretation

And here are some important topics that we did not have time to cover:

Testing assumptions for t-tests, linear regression, and logistic regression.
Relating quantitative research questions to quantitative research methods.

Together, we will do as much of the following procedure as we can. You should bring up any questions you have at any point, both verbally or in the Zoom chat.

Step 1: In the Zoom chat, identify research questions that are relevant to genetic counseling research that might require quantitative methods to answer.

Step 2: Pick one or two research questions to examine carefully. Step 3: Identify the outcome(s) of interest for the selected research question(s).

Step 4: Describe the data we would need to answer the research question(s). How would this data be structured?

Step 5: What type of quantitative analysis tool(s) would help us answer our selected research question(s).

Step 6: What would be the capabilities and limitations of this approach?

As you start to develop your own research project plans, you can meet with Danielle and/or Anshul and we can go through the procedure above (and do much more) for your specific question of interest. We can discuss both qualitative and quantitative projects. Danielle will be the go-to person for quantitative projects and Anshul will be the go-to person for qualitative projects.

5.2.4 Quantitative Assessment

We will end our class with a brief quantitative methods assessment, at a separate link which Anshul will provide.

This deadline was originally Monday November 16 2020 but it has been postponed.↩
When this question was initially posted on Nov 12 2020, nativeBorn accidentally said female, which makes the assignment less streamlined (though still possible to complete), as one of you pointed out. I fixed this mistake on Nov 15 2020. It is fine for you to answer the question either in its previous or updated form. The goal is to practice using pivot tables.↩
Sometimes more than two, but typically just two.↩