# Chapter 2 Jan 13–19: Descriptive Statistics and Data Distributions

Please read everything in this chapter and then complete (and submit!) the assignment at the end.

Begin by opening the following PowerPoint document and reading the first 20 slides:

Follow along with the remaining slides (page #21 and up) as needed as you read the material below.

## 2.1 Descriptive Statistics

Please learn or review how to calculate the following important descriptive statistics that help us describe numeric data (a set of numbers) that we might have, using the linked resources or descriptions:

These videos might be helpful to learn how to calculate these statistics:

## 2.2 Data Distributions

Histograms can be used to describe data. Histograms simply count the number of values that are in your data within selected intervals. Read this page to see how they work:

There are different ways that data can be distributed. Read about some here:

Imagine that we measured the heights of hundreds or thousands of people. It is likely that our histogram of all the heights would look like this:5

This is called a normal distribution. Normal distributions can be spread out wide or very compact, but they all are tallest in the middle and shortest at the ends (the tails). They can all be characterized by a mean and standard deviation. Some examples are below.

Below is a normal distribution with 10000 samples (10000 measurements of something), mean = 50, and standard deviation = 5. You could pretend this is data on the number of questions that 10000 people got correct on a test. The average score was 50, the average deviation from that score was 5. The minimum score appears to be about 30 and the highest around 70 or 80.

hist(rnorm(10000, mean = 50, sd = 5), breaks=20, main ="", xlab = "Test Score",xlim = c(20,80))

Here below is another with 10000 samples, mean = 50, standard deviation = 1. You can see that this one is much more compact (which I have emphasized by keeping the x-axis range the same as above). You could pretend that these are the lengths of hand-manufactured walking sticks that are meant to be 50 inches in length but aren’t always perfect.

hist(rnorm(10000, mean = 50, sd = 1), breaks=20, main ="", xlim = c(20,80), xlab = "Walking Stick Lengths (in)")

And finally here is another normal distribution with 10000 samples, mean = 50, and standard deviation = 50. The next two histograms both show the same distributions, but with different x-axis ranges and buckets. I’m not sure what this could be an example of!

par(mfrow=c(1,2))
p1<- hist(rnorm(10000, mean = 50, sd = 50), breaks=20, main ="", xlim = c(20,80))
p2 <- hist(rnorm(10000, mean = 50, sd = 50), breaks=20, main ="", xlim = c(-200,300))

All three of these are normal distributions, each characterized by a different mean and standard deviation. The mean of a normal distribution that is balanced on both sides, like these ones, will often be the same as the mode of that distribution.

Next, go through these two pages for some more information on normal distributions. The UConn page is a short tutorial on the normal distribution. It links to the David Lane link, which is essentially a z-table. Enter values for z to see much area under the curve is associated with the z-value you selected.

These optional videos may also be useful:

## 2.4 Visualizing and Inspecting Your Data

The first step of data analysis should be to become familiar with your data through descriptive statistics and graphing (often using visualizations such as histograms, scatterplots, boxplots, and more).

1. Make sure that the values for each of your variables are valid (this includes checking for data entry errors, values outside the possible range for a variable).
2. Check to see if variables are normally distributed (often important for dependent variables).
1. Get a feel for how much variability you have in your variables. The descriptive statistics/characteristics we looked at above can be useful for this (especially mean, standard deviation, and shape of distribution).
2. Check for floor or ceiling effects. For example, were questions so easy (or so hard) that people got them all correct (or all wrong)?
1. If you have demographic variables, look at the characteristics of your sample (this affects generalizability, or how representative of a larger population your data is or isn’t).
2. Identify outliers and make preliminary determination of how you plan to handle them (more below).
3. Once you are confident you have a clean dataset, score and/or code any variables. Then double-check you have done those calculations correctly.

We will be practicing all of these guidelines throughout this course. Again, the first step of data analysis should be to become familiar with the data. Often, this is not the first step people take. They jump right into to scoring their variables and running analyses. Later, when they encounter problems or things don’t work as they expect, then they go back and look at the descriptives/frequencies/histograms. Sometimes people never go back, and they miss errors in the data (and those errors get published!). Definitely spend some time looking at your data before you jump to the analysis!

### 2.4.1 Outliers

Strategies for dealing with outliers are:

1. Remove observation6 from analysis.
2. Remove that particular data point
3. Transform the data (this can help, but not all the time)
4. Change nothing and run your analyses as planned

The strategy you choose to deal with outliers will depend on a lot of factors, and you need to think carefully about how you plan to handle extreme values in your analysis (and you will need to justify this in any manuscript). This will vary from dataset to dataset. You need to figure out what makes the most sense in the context of the research question you are trying to address.

## 2.5 Assignment

Please complete and submit all of the tasks, which are presented in separate sections below. You can put your responses into a Word document or similar type of document.

Do not hesitate to contact me with any questions or concerns, big or small. You can e-mail me at or message/call me on my cell phone, which is the number at the bottom of my e-mails.

### 2.5.1 Conceptual Questions

Task 2: Describe one situation where it would be more useful to use z-scores than raw scores when analyzing data.

Task 3: Using your own words, what does the standard deviation tell us, and how is it calculated?

### 2.5.2 Exploring Data

Here is some background on the dataset: A team of researchers have developed an intervention to reduce mental illness stigma among primary care physicians. The intervention is an online training program designed to address the common misconceptions and stereotypes about mental illness. They use a pre-post design, and were interested in changes in both mental illness stigma (operationalized as attitudes towards people with mental illness) and knowledge (operationalized as mental health literacy). They also collected demographic information from all participants.

Task 4: Familiarize yourself with the dataset. Look at the variable names, types of variables, labels, value labels, etc. How many variables (columns) are there in this data?

Task 5: What is the primary research question the researchers are trying to address with this study?

Task 6: What is the main independent variable the researchers are interested in?

Task 7: What are the primary dependent variable(s) the researchers are interested in?

Task 8: What are the levels of measurement of each of the variables in the dataset?

For the next few questions on outliers, in addition to visually inspecting the data in a spreadsheet, you might find it useful to make a box plot. The following resources might be useful:

Task 9: Are there any outliers or data entry errors in the dataset? Also explain how you made that determination. Present at least one box plot as part of your answer.

Task 10: How did you decide whether a value was an outlier?

Task 11: For any data entry errors: What kinds of error(s) did you find? How did you decide to “fix” the error?

Before moving forward to the next questions, make sure you are now working with a “clean” dataset in which you have handled any outliers.

Task 12: Report the mean and standard deviation for the pre and post stigma and knowledge variables.

Task 13: Are the stigma and knowledge variables normally distributed? Explain how you made that determination (it should involve multiple methods). Include histograms in your response.

Now, split the data by gender. You can do this by sorting the data by the gender column. Then, copy all of the rows for women into a new spreadsheet and all of the rows for men into yet another empty spreadsheet. Now you are ready to analyze the data separately for men and women.

Task 14: What are the mean and standard deviation for men and women for each of the stigma and knowledge variables? Are these variables normally distributed? Show the histograms for men and women for each of the stigma and knowledge variables.

Task 15: Based on your preliminary examination of the data, what do you think the next steps should be to answer the research question?

Task 16: You are required to meet with me (Anshul) on Zoom sometime between January 22–31 2020. If you have not done so already, e-mail me a few times that would work for you to meet for up to an hour. If any students are available at the same time, we can all meet together.

Android mobile app

Apple (iOS) mobile app

Task 18: Write a short biographical statement about yourself which will be shared with the others in the class.8 This does not have to be long. Also explain what you hope to get out of this class and/or any projects you may be involved with that this class might help with.

Task 19: You are required to name the file that you submit for this assignment according to the following convention: KumarAnshul-WeekJan13-Homework-HE802-MGHIHP. Replace “KumarAnshul” with your last name and then your first name. Then submit the assignment as explained below.

Task 20: This semester, we will use our Dropbox accounts provided through MGHIHP to share files, especially homework assignments. Sometime during the first week of the course, I will create a shared folder for each student.9 Please submit your homework assignment to this shared folder. Did you encounter any problems accessing the shared Dropbox folder that I shared with you? If you did encounter problems, we will troubleshoot when we meet on Zoom between January 22–31.

Task 21: Since this is the first week, also e-mail your assignment to me at as a backup.

1. Image source: https://i.stack.imgur.com/hvTdo.png

2. An observation is a row of your data when it is in a spreadsheet. A row of data can be a person, an organization, a group, a car, or anything else really about which data has been collected. An observation is also sometimes called a data point.

3. You can download and then open the file in Excel or a similar program on your own computer. You might also be able to click on the “Add to My Drive” button to copy it into your own Google Drive account and then open it in Google Sheets.

4. I will circulate everyone’s statements by e-mail or D2L. It will not be made public.

5. Each folder will have just two people who have access to it: one student in the course and me.