Chapter 2 Week 7 – Oct 21 2021 class meeting

This week, our goals are to…

  1. Become familiar with and practice writing research questions that qualitative research can answer.

  2. Become familiar with and practice writing interview questions that will help you use qualitative analysis to answer a research question.

  3. Review basic descriptive statistics.

  4. Review basic linear relationships and linear equations, including the meanings of slope and intercept.

  5. Set up Microsoft Excel so that you can do basic quantitative data analysis.

2.1 Before class

2.1.1 Checklist – Complete by Oct 21 2021

By our class meeting on Thursday, October 21, 2021, you should complete the following tasks:

This is all you need to do before we meet for class. If anything is unclear or you have any questions, do not hesitate to email Anshul at or contact me by phone.

It is fine to work with others on the assignments, but make sure you state who you worked with at the top of your assignment.

Please adhere carefully to all word/sentence limit requirements. In general, your answers should be very short and precise (often just one or two sentences). I do not want you to write a lot in these assignments. If you do feel like writing more (perhaps to brainstorm or get feedback about an idea), please send that extra writing to me in an email so that we can more easily discuss it.

2.1.2 Qualitative Assignment #1

Due by the start of class on October 21 2021.

Please type your answers in a word processor on your computer. Do not print them out, but be ready to share them on your computer screen or by email during class. Also submit your completed assignment document in the appropriate D2L dropbox.

2.1.2.1 Part 1 – Research Question

Task 1: Imagine that you are going to conduct qualitative research on genetic counselors. Write two research questions that lend themselves to qualitative analysis. These research questions should be one sentence each and end in a question mark.

2.1.2.2 Part 2 – Interview Questions

Task 2: Choose one of your two research questions from above. Draft three open-ended questions that you could ask in an interview that would elicit rich, detailed information from your interviewee that would help you answer your selected research question. For each of these three questions, write no more than two sentences about why this question will help you answer your research question.

Task 3: Draft one question that would help you answer your research question but is not open-ended.

2.1.3 Quantitative Assignment #1

Due by the start of class on October 21, 2021.

It is highly recommended hand-write your answers and bring your completed, hand-written assignment to class on paper. If you can find a way to do all of the required tasks in the assignment electronically, that is also acceptable. Also please scan or photograph your assignment and submit it to the appropriate D2L dropbox.

2.1.3.1 Part 1 – Basic Descriptive Statistics and Tests

First, we need to review some basic math and statistics. Consider the height and sex of 20 people who were surveyed:

sex heightcm
Male 174
Male 189
Female 185
Female 195
Male 149
Male 189
Male 147
Male 154
Male 174
Female 169
Male 195
Female 159
Female 192
Male 155
Male 191
Female 153
Female 157
Male 140
Male 144
Male 172

Task 4: How many men and how many women are in this sample of 20 people?

Task 5: Separately for women and men, calculate “by hand” the (arithmetic) mean, median, mode, range,6 and standard deviation of height. Show all of your work.7 This video might be helpful for mean, median and mode; and this video might be helpful for standard deviation.

Task 6: The shortest man in the sample is 140 cm tall. How many standard deviations below the mean is his height?8 Remember to use the correct standard deviation out of the two you calculated.

Task 7: The tallest woman is 195 cm. How many standard deviations above the mean is she?

Task 8: List a set of five numbers in which the mean is greater than the median. List a set of five numbers in which the median is greater than the mode. List a set of five numbers for which the mean and median are equal. Show work to prove it.

Now have a look at this histogram made with 50 people’s heights, broken up into 10-cm ranges or “buckets”:

Task 9: Does this distribution of heights above appear to be normal, bimodal, uniform, or something else? Figure 1 on this page or searching here may help you answer this question.

If our sample contained hundreds or thousands of people instead of just tens, it is likely that our histogram would look like this:9

This is called a normal distribution. Normal distributions can be spread out wide or be very compact, but they all are tallest in the middle and shortest at the ends (the tails). They can all be characterized by a mean and standard deviation. Some examples are below.

Below is one with 10000 samples, mean = 50, and standard deviation = 5. You could pretend this is data on the number of questions that 10000 people got correct on a test. The average score was 50, the average deviation from that score was 5. The minimum score appears to be about 30 and the highest around 70 or 80.

Here below is another with 10000 samples, mean = 50, standard deviation = 1. You can see that this one is much more compact (which I have emphasized by keeping the x-axis range the same as above). You could pretend that these are the lengths of hand-manufactured walking sticks that are meant to be 50 inches in length but aren’t always perfect.

And finally here is another normal distribution with 10000 samples, mean = 50, and standard deviation = 50. The next two histograms both show the same distributions, but with different x-axis ranges and buckets. I’m not sure what this could be an example of!

All three of these are normal distributions, each characterized by a different mean and standard deviation. The mean of a normal distribution that is balanced on both sides, like these ones, will often be the same as the mode of that distribution.

2.1.3.2 Part 2 – Comparing two distributions

For this part of the assignment, we will use the 2 Sample T-Test tool (which I will call the tool throughout this section of the assignment).

Imagine that we, some researchers, are trying to answer the following research question: How does fertilizer affect plant growth?

So we conduct a randomized controlled trial in which some plants are given a fixed amount of fertilizer (treatment group) and other plants are given no fertilizer (control group). Then we measure how much each plant grows over the course of a month. Let’s say we have ten plants in each group and we find the following amounts of growth.

The 10 plants in the control group each grew this much (each number corresponds to one plant’s growth):

3.8641111
4.1185322
2.6598828
0.3559656
2.8639095
0.9020122
5.0527020
2.3293899
3.5117162
4.3417785

The 10 plants in the treatment group each grew this much:

7.532292
1.445972
6.875600
6.518691
1.193905
4.659153
3.512655
4.578366
8.791810
4.891557

Delete the numbers that are pre-populated in the tool. Copy and paste our control data in as Sample 1 and our treatment data in as Sample 2.

Task 10: What is the mean and standard deviation of the control data? What is the mean and standard deviation of the treatment data? Do not calculate these by hand. The tool will tell these to you in the sample summary section.

You’ll see that the tool has drawn the distributions of the data for our treatment and control groups. That’s how you can visualize the effect size (impact) of an RCT. It has also given us a verdict at the bottom that the “Sample 2 mean is greater.” This means that this particular statistical test (a t-test) concludes that we are more than 95% certain that sample 1 (the control group) and sample 2 (the treatment group) are drawn from separate populations. In this case, the control group is sampled from the “population” of plants that didn’t get fertilizer and the treatment group is sampled from the “population” of those that did.

This process is called inference. We are making the inference, based on our 20-plant study, that in the broader population of plants, fertilizer is associated with more growth. The typical statistical threshold for inference is 95% certainty. In the difference of means section of the tool, you’ll see p = 0.0468 written. This is called a p-value. The following formula gives us the percentage of certainty we have in a statistical estimate, based on the p-value (which is written as p): \(\text{Level of Certainty} = (1-p)*100\). To be 95% certain or higher, the p-value must be equal to 0.05 or lower. That’s why you will often see p<0.05 written in studies and/or results tables.

With these particular results, our experiment found statistically significant evidence that fertilizer is associated with plant growth.

Now, click on the radio buttons next to ‘Sample 1 summary’ and ‘Sample 2 summary.’ This will allow you to compare different distributions to each other quickly, without having to change the numbers above. Let’s imagine that the control group had not had as much growth as it did. Change the Sample 2 mean from 5 to 4.5.

Task 11: What is the new p-value of this t-test, with the new mean for Sample 2? What is the conclusion of our experiment, with these new numbers?

Task 12: Gradually reduce the standard deviation of Sample 2 until the results are statistically significant at the 95% certainty level. What is the relationship between the standard deviation of your samples and our ability to distinguish them from each other statistically?10

2.1.3.3 Part 3 – Basic Linear Relationships

Remember this?

\[ y = mx + b \] This is a linear equation in which

\[m = slope\] \[b = intercept\]

Task 13: Draw a small coordinate plane (graph) on your paper. Graph the equation \(y = 2x + 1\). When x = 7, what is y? Plug 7 into the equation in place of x to figure it out! Show all of your work.

Task 14: On the same coordinate plane, graph the equation \(y = 0.5x + 3\). As x increases by one unit, what happens to y? When x = 7, what is y?

Now I’m going to rewrite the formula for a linear equation:

\[ y = b_1 x+b_0\] In this new version, \(b_1\) is the slope and \(b_0\) is the intercept. Statistical results and formulas are often written with these \(b_{whatever}\) coefficients rather than \(m\).

Linear equations allow us to figure out the relationship between two variables in a survey data set. Let’s look at this data on cars:

mpg cyl disp hp drat wt qsec vs
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1

Survey data is arranged in a spreadsheet, with each row corresponding to an observation and each column corresponding to a characteristic or variable. In this case, the unit of observation is the car, so each row in this data is a car. There are 32 cars in total in the data. A survey-taker surveyed these 32 cars and found out a number of characteristics about them.

Task 15: What was the unit of observation in the data on height that you saw earlier in this assignment?

Consider this research question: Is a car’s gas efficiency influenced by the number of cylinders it has?

This question is very hard to answer, because we are asking if a car’s cylinders cause its gas efficiency. This question is too hard to answer, so we are going to tackle a slightly easier research question: Is gas efficiency, as measured by miles per gallon (mpg) associated with the number of cylinders (cyl) that a car has?

Since mpg is our outcome of interest, it is our dependent variable.

Task 16: What was the dependent variable in our pretend RCT about plants?

Since cyl is our “input” of interest, it is the independent variable. We want to know how mpg varies as a function of cyl. In other words: as we vary cyl, what happens to mpg?

Task 17: What was the independent variable in our pretend RCT about plants?

Remember: the dependent variable depends upon the independent variable. The independent variable doesn’t depend on anything; it’s independent and can do whatever it wants.

Now it’s time to see what the statistical relationship is between mpg and cyl, or mpg vs cyl, we could say. We always write [dependent variable] vs [independent variable]. Let’s start with a simple scatterplot:

We always put the dependent variable on the y-axis (the vertical axis) and the independent variable on the x-axis (the horizontal axis). Clearly, this plot suggests that there is a noteworthy relationship between mpg and cyl.

Next, we run a linear regression on these two variables in this data:

mpg vs cyl linear regression result, from Excel

This is where we get back to the linear equation. This regression analysis output is just a linear equation:

\[mpg = -2.9cyl+37.9\]

We can plot this line along with our data:

Remember:

\[ y = b_1 x+b_0 \] In this case, \(y\) is the dependent variable, which is mpg. \(x\) is the independent variable, which is cyl. \(b_1 = -2.9\) and \(b_0 = 37.9\). \(b_1\) is the slope and \(b_0\) is the intercept. Notice that we now used \(b_1\) to represent slope (instead of \(m\)) and \(b_0\) to represent intercept (instead of \(b\)); both ways are fine.

This is how we phrase the results of this regression analysis: For each additional cylinder, a car is predicted to have 2.9 fewer miles per gallon of gas efficiency. It is not a certainty. It is just a prediction. It is a prediction about the relationship between two variables.

Now, let’s make some more predictions, but these ones are about a specific outcome, rather than about a relationship.

Task 18: If a car has 8 cylinders, what is its predicted gas efficiency? Show your work.11

Task 19: If a car has 4 cylinders, what is its predicted gas efficiency? Show your work.

2.1.3.4 Part 4 – Set up Excel

To analyze data in this course, we will use Microsoft Excel. While more powerful data analysis platforms might have been desirable to learn instead, we don’t have the time right now to learn one of those platforms to a satisfactory extent. In this section of the assignment, you will set up Excel such that you can easily perform some of the statistical tests and procedures that we will use during our time together.

Task 20: Do you have Excel on your computer? If yes, proceed to the next item. If not, please contact Anshul right away at so that this can be resolved.

Task 21: Open Excel on your computer. Click on the Data tab and look for the Data Analysis button. The image below shows what this might look like. If you do see the Data Analysis button as an option, proceed to the next item. If you do not see the Data Analysis button, please click here to find instructions to add the Data Analysis option to your Excel. Contact Anshul at right away if you are not able to get the Data Analysis button to appear.

You will need to use the Data Analysis tool shown at the far right

Task 22: Click on the Data Analysis button (pictured above on the right). You should see a new window like the one pictured below. Please make sure that you have the option to choose Regression from the list of available tools. Contact Anshul at right away if you are not able to see the Regression option.

Make sure you have the Regression option in Excel as pictured here

We’ll pick up from this point in class.

2.2 In class

2.2.1 Schedule

October 21 2021

2.2.2 Qualitative activity

Each of you will work in your existing project teams for the next four classes. With your team, you will work on a small qualitative “research project.” This project is just for practice. This is not real research and we will never publish or disseminate what we learn in these projects.

You will present your results to the entire class in a short presentation during our Week 11 meeting. The research subjects/participants for your project will be the members of one of the other three teams in the class:

  • Locus Pocus and Nucleo-tides will study each other.
  • Genie Beanies and MJK: Master Jenetic Kounselors will study each other.

These are the research questions that were decided upon during class:

Team Research question Team being studied
Locus Pocus What are certain coping methods used to combat the challenges faced by first-year genetic counseling students? Nucleo-tides
Nucleo-tides What are some of the factors prospective GC students consider when ranking different programs? Locus Pocus
Genie Beanies What is the impact of childhood experience of genetic counseling students on wanting to become a genetic counselor? MJK: Master Jenetic Kounselors
MJK: Master Jenetic Kounselors What was your experience as a prospective student interviewing for genetic counseling graduate programs that use a holistic admission method? Genie Beanies

As homework this upcoming week, you will interview one person on the team you are studying, asking the questions that your team has prepared. That same person who you interviewed will interview you, asking the (different) questions that their team has prepared.12 Each person in the class will be the interviewer once and the interviewee once. If the sizes of teams that are paired together to study each other are not the same, please alert me so that we can make adjustments.

Today in class, you will work with your teammates to decide the research question you will study and draft the interview questions that you will ask the members of the other team. All of you in your group will ask the same questions to the people you interview.

Please complete the following tasks/questions, working with your team. It might be useful to work together in a Google Doc. Note that you may also be able to take advantage of screen sharing features.

Task 24: Decide on a research question to investigate. It can be anything that you are interested in knowing about the team you are “studying.” Show your research question to Anshul as soon as you have decided.13 This should not take long. You can share your answers to your Qualitative Assignment #1 with each other at this time, as a starting point to your rapid brainstorming.

Important: Do not choose a research question that will require you to ask overly personal or privacy-invading questions to your interviewees.

Examples of acceptable questions include:

  1. Why do genetic counseling students want to be genetic counselors?
  2. What are the main challenges faced by first-year students pursuing a genetic counseling degree?
  3. What is the impact of childhood experiences on sub-specialty interest among genetic counseling students?
  4. What are the long-term career goals of aspiring genetic counselors? What are the root causes of these goals?

Once again, don’t forget to tell Anshul your selected research question as soon as you decide on it!

Task 25: Draft in-depth interview questions that each member of your team will ask to their interviewee over the coming week. This list of questions is also called an interview schedule. Remember that your goal is to elicit long, detailed, in-depth answers from your interviewees. Therefore, you should ask mostly open-ended questions. Also, you are not only trying to figure out what the answer to your research question is, but also why it is the case. You want to uncover the underlying mechanisms that drive the phenomenon you are studying.

Task 26: Draft a list of follow-up questions that you can ask or prompts that you can say to the interviewee to get more information or details from them. Some of these follow-up questions can be generic (meaning you can use them at any time during the interview) and others can be tied to particular questions that you have drafted.

Task 27: Discuss any logistics related to conducting the interviews. Be sure to take the following into consideration:

  • When you will do the interviews (early enough so that you have time to transcribe, which is part of the homework due by the start of our next class meeting).
  • Privacy and confidentiality: Since this is just practice research, when you are being interviewed, make sure that you do not share any information that you want to keep private. Anything you say in an interview may be shared with the whole class, unlike in real research where confidentiality is required.
  • Plan to meet with the person you are interviewing (in the other team, the team that your team is studying) for one hour, during which time you will interview them for 30 minutes with your team’s interview questions and then they will interview you for 30 minutes with their team’s interview questions.
  • Ideally you will audio record the interview, so that you can later listen to that recording and transcribe.
  • Anything else you can think of.

Please alert Anshul when you have completed this in-class qualitative activity. After doing this, if you finish early, you can take a break or begin the quantitative activity below.

2.2.3 Quantitative activity

Please complete the following quantitative activity. If you need more time to finish, please do so as homework within the next seven days. I am also available to meet by appointment to assist with finishing this activity. Each student should try to complete this activity on their own computer, but it is fine to work together with a partner.

In this activity, we will use a dataset called GSSVocab. We will be trying to answer the following research question: Are age, gender, and years of education associated with vocabulary?

The names of the four variables you will use are age, female, educ, and vocab.

Task 28: Download the file GSSvocab.xlsx from D2L to your own computer. It is located in Course Materials -> Content -> Week 7- Oct 21- Methodology 1. Open this file in Excel in your computer. Note that I modified a few things in the original data before putting it in D2L for you.

Task 29: How many observations are in this dataset?

Task 30: How many rows of data are in this dataset?

Task 31: How many people were surveyed in this dataset?

Task 32: What is the sample size of this dataset?

Task 33: How many variables are in this dataset?

Task 34: How many columns are in this dataset?

Task 35: What is (are) our dependent variable(s) of interest?

Task 36: What is (are) our independent variable(s) of interest?

Task 37: Present relevant basic descriptive statistics for the four variables age, female, educ, and vocab.14 This resource might be helpful. You can focus on just mean and standard deviation for this. Do not calculate by hand; the computer will calculate for you.

Task 38: Create a histogram for each of our four variables of interest. This resource might be helpful. How do each of the variables appear to be distributed? Do these histograms approximately match your calculated basic descriptive statistics from earlier?

Task 39: Make a scatterplot showing the relationship between vocab and educ. Be sure to put the dependent variable on the vertical axis (y-axis). This resource might be helpful.

Task 40: Add a trendline to the scatterplot you just made. This resource might be helpful.

Task 41: What is the relationship between vocab and educ, as suggested by the trendline you just drew on your scatterplot? Does this trend make sense to you or not?

Task 42: Run a linear regression to answer your research question. This resource might be helpful.

Task 43: Write the equation for the regression results that you got above. Remember that you already have seen an example of this before, in the quantitative assignment you did before today’s class. You will need to look at the Coefficients section of the linear regression output.

Task 44: Write a single sentence each with the interpretation of the coefficients for age, gender, and educ.

Task 45: How much vocabulary does our regression model predict someone with the following characteristics will have? 18 years old, female, 16 years of education.

Let Anshul know when you finish this in-class quantitative activity.


  1. Range is the difference between the smallest and largest value in a sample of data.↩︎

  2. I know this is annoying but it’s important to do once.↩︎

  3. Hint intended to help you but still make you think: Imagine a new unit of length called a blah. One blah is 3 cm. If you are 27 cm tall and I am 18 cm tall, we use subtraction to figure out that you are 9 cm taller than me. How many blahs shorter than you am I? Well, we have to convert 9 cm into blahs: (9 cm)/(3 cm/blah) = 3 blahs. In the previous question, you calculated how many cm are in a standard deviation. Just replace blah with standard deviation in this example to get the answer to the question (and replace my height with 140 cm and your height with the mean).↩︎

  4. Image source: https://i.stack.imgur.com/hvTdo.png↩︎

  5. Remember, when we are analyzing the data in an RCT, we are trying to figure out if the treatment and control groups had different or similar results. We are seeing if we can distinguish the two groups from each other in any way. The mean and standard deviation of the data in the two groups are the key parameters that help us tell the treatment and control groups apart, which is why you need to play around with the t-test tool to understand these relationships.↩︎

  6. This is just like the earlier question in which you plugged values of x into the linear equation.↩︎

  7. A more concrete example: Let’s say teams A and B have been assigned to study each other. All members of team A will decide on a research question and interview questions to ask. All members of team B will do the same, deciding on a different research question, with unique interview questions to study it. Let’s say Allie is a member of team A and Brody is a member of team B. Allie will interview Brody about team A’s research question, and Brody will interview Allie about team B’s research question.↩︎

  8. In a virtual class setting, this may require one of you to briefly exit your breakout room and go back to the main room↩︎

  9. Mean and standard deviation are two examples of descriptive statistics.↩︎