Chapter 2 Jan 13–19: Linear Relationships and RStudio Basics

Please read and follow the instructions in this chapter and then complete (and submit) the assignment at the end. Since this is an advanced, PhD-level course, the first few weeks will include a review of basic statistical and math concepts at a fast pace.

This week, our goals are to…

1. Review basic descriptive statistics and calculate them by hand.

2. Review basic linear relationships and linear equations, including the meaning of slope and intercept.

3. Set up R, RStudio, and/or RStudio Cloud and execute basic tasks using R code.

2.1 Basic Concepts

Please skim through all of the content in this section called Basic Concepts. You do not need to read it all in detail if you feel comfortable with the material. But you should read carefully anything that you do not understand. Normally you will have to read everything carefully in this course, but this is an exception.

Begin by opening the following PowerPoint document and reading the first 20 slides:

Follow along with the remaining slides (page #21 and up) as needed as you read the material below.

2.1.1 Descriptive Statistics

Please learn or review how to calculate the following important descriptive statistics that help us describe numeric data (a set of numbers) that we might have, using the linked resources or descriptions:

These videos might be helpful to learn how to calculate these statistics:

2.1.2 Data Distributions

Histograms can be used to describe data. Histograms simply count the number of values that are in your data within selected intervals. Read this page to see how they work:

There are different ways that data can be distributed. Read about some here:

Imagine that we measured the heights of hundreds or thousands of people. It is likely that our histogram of all the heights would look like this:6

This is called a normal distribution. Normal distributions can be spread out wide or very compact, but they all are tallest in the middle and shortest at the ends (the tails). They can all be characterized by a mean and standard deviation. Some examples are below.

Below is a normal distribution with 10000 samples (10000 measurements of something), mean = 50, and standard deviation = 5. You could pretend this is data on the number of questions that 10000 people got correct on a test. The average score was 50, the average deviation from that score was 5. The minimum score appears to be about 30 and the highest around 70 or 80.

hist(rnorm(10000, mean = 50, sd = 5), breaks=20, main ="", xlab = "Test Score",xlim = c(20,80))

Here below is another with 10000 samples, mean = 50, standard deviation = 1. You can see that this one is much more compact (which I have emphasized by keeping the x-axis range the same as above). You could pretend that these are the lengths of hand-manufactured walking sticks that are meant to be 50 inches in length but aren’t always perfect.

hist(rnorm(10000, mean = 50, sd = 1), breaks=20, main ="", xlim = c(20,80), xlab = "Walking Stick Lengths (in)")

And finally here is another normal distribution with 10000 samples, mean = 50, and standard deviation = 50. The next two histograms both show the same distributions, but with different x-axis ranges and buckets. I’m not sure what this could be an example of!

par(mfrow=c(1,2))
p1<- hist(rnorm(10000, mean = 50, sd = 50), breaks=20, main ="", xlim = c(20,80))
p2 <- hist(rnorm(10000, mean = 50, sd = 50), breaks=20, main ="", xlim = c(-200,300))

All three of these are normal distributions, each characterized by a different mean and standard deviation. The mean of a normal distribution that is balanced on both sides, like these ones, will often be the same as the mode of that distribution.

Next, go through these two pages for some more information on normal distributions. The UConn page is a short tutorial on the normal distribution. It links to the David Lane link, which is essentially a z-table. Enter values for z to see much area under the curve is associated with the z-value you selected.

These optional videos may also be useful:

2.1.4 Visualizing and Inspecting Your Data

The first step of data analysis should be to become familiar with your data through descriptive statistics and graphing (often using visualizations such as histograms, scatterplots, boxplots, and more).

1. Make sure that the values for each of your variables are valid (this includes checking for data entry errors, values outside the possible range for a variable).
2. Check to see if variables are normally distributed (often important for dependent variables).
1. Get a feel for how much variability you have in your variables. The descriptive statistics/characteristics we looked at above can be useful for this (especially mean, standard deviation, and shape of distribution).
2. Check for floor or ceiling effects. For example, were questions so easy (or so hard) that people got them all correct (or all wrong)?
1. If you have demographic variables, look at the characteristics of your sample (this affects generalizability, or how representative of a larger population your data is or isn’t).
2. Identify outliers and make preliminary determination of how you plan to handle them (more below).
3. Once you are confident you have a clean dataset, score and/or code any variables. Then double-check you have done those calculations correctly.

We will be practicing all of these guidelines throughout this course. Again, the first step of data analysis should be to become familiar with the data. Often, this is not the first step people take. They jump right into to scoring their variables and running analyses. Later, when they encounter problems or things don’t work as they expect, then they go back and look at the descriptives/frequencies/histograms. Sometimes people never go back, and they miss errors in the data (and those errors get published!). Definitely spend some time looking at your data before you jump to the analysis!

2.1.4.1 Outliers

Strategies for dealing with outliers are:

1. Remove observation7 from analysis.
2. Remove that particular data point
3. Transform the data (this can help, but not all the time)
4. Change nothing and run your analyses as planned

The strategy you choose to deal with outliers will depend on a lot of factors, and you need to think carefully about how you plan to handle extreme values in your analysis (and you will need to justify this in any manuscript). This will vary from dataset to dataset. You need to figure out what makes the most sense in the context of the research question you are trying to address.

2.2 Regression Analysis

This is a new section of the chapter and you must read it carefully, unlike above where you were allowed to skim.

The primary focus of this course is regression analysis techniques. Please read/watch the following materials as an introduction to regression.

2.3 Assignment

Please complete all of the tasks below and submit them by the due date (which you can find in the course calendar in Chapter 1). You should answer in a Word file or any similar type of file that you prefer to use. Additionally, you will need to separately submit an R code file that shows all of the R commands you entered in RStudio Cloud.

2.3.1 Descriptive Statistics Review

Consider the height and sex of 20 people who were surveyed:

head(height[c("sex","heightcm")],20)
 sex heightcm Male 174 Male 189 Female 185 Female 195 Male 149 Male 189 Male 147 Male 154 Male 174 Female 169 Male 195 Female 159 Female 192 Male 155 Male 191 Female 153 Female 157 Male 140 Male 144 Male 172

Question 1: How many men and how many women are in this sample of 20 people?

Question 2: Separately for women and men, calculate “by hand” the (arithmetic) mean, median, mode, range,8 and standard deviation of height. Show all of your work.9 This video might be helpful for mean, median and mode; and this video might be helpful for standard deviation.

Question 3: The shortest man in the sample is 140 cm tall. How many standard deviations below the mean is his height?10 Remember to use the correct standard deviation out of the two you calculated.

Question 4: The tallest woman is 195 cm. How many standard deviations above the mean is she?

Question 5: List five numbers for which the mean is greater than the median. List five numbers for which the median is greater than the mode. List five numbers for which the mean and median are equal. Show work to prove it.

Now have a look at this histogram made with 50 people’s heights, broken up into 10-cm ranges or “buckets”:

hist(height$heightcm, main="Height of sampled people", xlab="Height (cm)", border="black", col="skyblue", xlim=c(140,200), las=1, breaks=5) Question 6: Does this distribution of heights above appear to be normal, bimodal, uniform, or something else? Figure 1 on this page or searching here may help you answer this question. 2.3.2 Linear Relationships Review $y = mx + b$ Remember this? This is a linear equation in which $m = slope$ $b = intercept$ Question 7: Draw a small coordinate plane (graph) on your paper. Graph the equation $$y = 2x + 1$$. When x = 7, what is y? Plug 7 into the equation in place of x to figure it out! Show all of your work. Question 8: On the same coordinate plane, graph the equation $$y = 0.5x + 3$$. As x increases by one unit, what happens to y? When x = 7, what is y? Now I’m going to rewrite the formula for a linear equation: $y = b_1 x+b_0$ In this new version, $$b_1$$ is the slope and $$b_0$$ is the intercept. Statistical results and formulas are often written with these $$b_{whatever}$$ coefficients rather than $$m$$. Linear equations allow us to figure out the relationship between two variables in a survey data set. Let’s look at this data on cars: mtcars  mpg cyl disp hp drat wt qsec vs am gear carb 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21 6 160 110 3.9 2.88 17 0 1 4 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1 18.7 8 360 175 3.15 3.44 17 0 0 3 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 24.4 4 147 62 3.69 3.19 20 1 0 4 2 22.8 4 141 95 3.92 3.15 22.9 1 0 4 2 19.2 6 168 123 3.92 3.44 18.3 1 0 4 4 17.8 6 168 123 3.92 3.44 18.9 1 0 4 4 16.4 8 276 180 3.07 4.07 17.4 0 0 3 3 17.3 8 276 180 3.07 3.73 17.6 0 0 3 3 15.2 8 276 180 3.07 3.78 18 0 0 3 3 10.4 8 472 205 2.93 5.25 18 0 0 3 4 10.4 8 460 215 3 5.42 17.8 0 0 3 4 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 30.4 4 75.7 52 4.93 1.61 18.5 1 1 4 2 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1 21.5 4 120 97 3.7 2.46 20 1 0 3 1 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 19.2 8 400 175 3.08 3.85 17.1 0 0 3 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 26 4 120 91 4.43 2.14 16.7 0 1 5 2 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 15 8 301 335 3.54 3.57 14.6 0 1 5 8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 Survey data is arranged in a spreadsheet, with each row corresponding to an observation and each column corresponding to a characteristic or variable. In this case, the unit of observation is the car, so each row in this data is a car. There are 32 cars in total in the data. A survey-taker surveyed these 32 cars and found out a number of characteristics about them. Question 9: What was the unit of observation in the data on height that you saw earlier in this assignment? Consider this research question: Is a car’s gas efficiency influenced by the number of cylinders it has? This question is very hard to answer, because we are asking if a car’s cylinders cause its gas efficiency. This question is too hard to answer, so we are going to tackle a slightly easier research question: Is gas efficiency, as measured by miles per gallon (mpg) associated with the number of cylinders (cyl) that a car has? Question 10: What is the dependent variable in this research question? Question 11: What is the independent variable in this research question? Remember: the dependent variable depends upon the independent variable. The independent variable doesn’t depend on anything; it’s independent and can do whatever it “wants.” Now it’s time to see what the statistical relationship is between mpg and cyl, or mpg vs cyl, we could say. We always write [dependent variable] vs [independent variable]. Let’s start with a simple scatterplot: plot(mtcars$cyl,mtcars$mpg) We always put the dependent variable on the y-axis (the vertical axis) and the independent variable on the x-axis (the horizontal axis). Clearly, this plot suggests that there is a noteworthy relationship between mpg and cyl. Next, we run a linear regression on these two variables in this data: summary(lm(mpg~cyl,data=mtcars)) ## ## Call: ## lm(formula = mpg ~ cyl, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.9814 -2.1185 0.2217 1.0717 7.5186 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.8846 2.0738 18.27 < 2e-16 *** ## cyl -2.8758 0.3224 -8.92 6.11e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.206 on 30 degrees of freedom ## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171 ## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10 This is where we get back to the linear equation. This regression analysis output is just a linear equation: $mpg = -2.9cyl+37.9$ Remember: $y = b_1 x+b_0$ In this case, $$y$$ is the dependent variable, which is mpg. $$x$$ is the independent variable, which is cyl. $$b_1 = -2.9$$ and $$b_0 = 37.9$$. $$b_1$$ is the slope and $$b_0$$ is the intercept. This is how we phrase the results of this regression analysis: For each additional cylinder, a car is predicted to have 2.9 fewer miles per gallon of gas efficiency. It is not a certainty. It is just a prediction. Now, let’s make some more predictions. Question 12: If a car has 8 cylinders, what is its predicted gas efficiency? Show each step of your work.11 Question 13: If a car has 4 cylinders, what is its predicted gas efficiency? Show your work. 2.3.3 Set Up and Explore RStudio Cloud To analyze data in this course, we will use a free, online platform called RStudio Cloud. It’s kind of like Google Drive but for R.12 This section of the homework will help you set it up and use some of its basic functions. Those of you who participated in HE-942 have already had some exposure to it. Have a look at this short video and try to follow along to set it up for yourself. Go to http://rstudio.cloud to access RStudio Cloud for free. Here are some tips and notes: • I personally log into RStudio Cloud using my Google account. You can do that or make a new account if you prefer. • The video says that each group just needs one workspace to share, but I want each individual one of you in the course to make your own workspace and new project. At the 2:00 mark of the video, you’ll see that RStudio has been loaded into the browser and the cursor is blinking in the Console, ready for you to type something. Try typing some commands into this. First, just type 2+2 and hit enter. Below is the command and the result you should get:13 2+2 ## [1] 4 IMPORTANT INSTRUCTION: Now let’s make a code file, called a script, in which you can save all of the commands you run as you do your work. Please put all of your remaining commands into an R script file so that I can see what you did! To make such a file, click on File -> New File -> R Script. Next, save the file by clicking on File -> Save or on the little picture of a disk, just like you might in your favorite word processor. Make sure to save your work often. For the rest of this assignment, put all R commands into this new file. Each command should be on its own line within the file. Now let’s run some more commands. Just copy and paste what you see below into your R script file and you should get the result that’s displayed below as well. Put the cursor on the first of these three lines in your script file and then push the Run button. You’ll have to do this three times, once for each of the three lines of code below. When you push the Run button, RStudio sends the command from your script file to the console. The benefit of putting our commands into a script file is that they are saved for us forever, in case we want to go back and look at what commands we ran. If we instead just enter commands directly into the console, we will lose that record of what we did as soon as we close R. apple = 4 banana = 7 apple+banana ## [1] 11 Above, apple and banana are both variables that have been stored in R as 4 and 7, respectively. Question 14: Create two new variables with different names than the ones above. Then subtract one from the other and show the result. You should accomplish this using three lines of code, just like in the apple and banana example above. Make sure you put this into your R script file and not directly into the console. mtcars  mpg cyl disp hp drat wt qsec vs am gear carb 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21 6 160 110 3.9 2.88 17 0 1 4 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 21.4 6 258 110 3.08 3.21 19.4 1 0 3 1 18.7 8 360 175 3.15 3.44 17 0 0 3 2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 24.4 4 147 62 3.69 3.19 20 1 0 4 2 22.8 4 141 95 3.92 3.15 22.9 1 0 4 2 19.2 6 168 123 3.92 3.44 18.3 1 0 4 4 17.8 6 168 123 3.92 3.44 18.9 1 0 4 4 16.4 8 276 180 3.07 4.07 17.4 0 0 3 3 17.3 8 276 180 3.07 3.73 17.6 0 0 3 3 15.2 8 276 180 3.07 3.78 18 0 0 3 3 10.4 8 472 205 2.93 5.25 18 0 0 3 4 10.4 8 460 215 3 5.42 17.8 0 0 3 4 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 30.4 4 75.7 52 4.93 1.61 18.5 1 1 4 2 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1 21.5 4 120 97 3.7 2.46 20 1 0 3 1 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4 19.2 8 400 175 3.08 3.85 17.1 0 0 3 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 26 4 120 91 4.43 2.14 16.7 0 1 5 2 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 15 8 301 335 3.54 3.57 14.6 0 1 5 8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 mtcars, shown above, is a dataset that is built into R for us to use whenever we want. That’s why it was easy for me to use it as an example earlier in this assignment. Now you can try some of the same commands. Run each line one at a time: plot(mtcars$cyl,mtcars$mpg) abline(lm(mpg ~ cyl, data=mtcars)) Let’s calculate the mean and standard deviation of the variable mpg: mean(mtcars$mpg)
## [1] 20.09062
sd(mtcars$mpg) ## [1] 6.026948 Above, we are telling R that we want the mean and sd of the variable mpg. But we need to tell R where that variable is. It is located in the mtcars dataset. So we type mtcars$ which tells R to look inside of mtcars. Then a list of variables will appear. You can then either type mpg or use the mouse and/or arrow keys to find and select it. I like to partially type the variable and then hit enter when the correct one is highlighted.

Make sure you save your R script file as you continue to work. You should get in the habit of doing this regularly.

Question 15: Use R to calculate the mean and standard deviation of cyl.

Question 16: You are required to meet with me (Anshul) on Zoom sometime between January 22–31 2020. If you have not done so already, e-mail me a few times that would work for you to meet for up to an hour. If any students are available at the same time, we can all meet together.

Android mobile app

Apple (iOS) mobile app

Question 18: Write a short biographical statement about yourself which will be shared with the others in the class.14 This does not have to be long. Also explain what you hope to get out of this class and/or any projects you may be involved with that this class might help with.

Question 19: You are required to name the file that you submit for this assignment according to the following naming convention: KumarAnshul-WeekJan13-Homework-HE902-MGHIHP. Replace “KumarAnshul” with your last name and then your first name. Then submit the assignment as explained below. You will submit two files: your file with all of your work and answers and another with your R code (the script file). You can add the suffix -RCode to your code file if you want, but both files should start with the same name that follows the naming convention. To download your R script file from RStudio Cloud, find your saved file in the file viewer within RStudio Cloud, select the file by using the checkbox to the left of the file, and then click on More -> Export....

Question 20: This semester, we will use our Dropbox accounts provided through MGHIHP to share files, especially homework assignments. Sometime during the first week of the course, I will create a shared folder for each student.15 Please submit your homework assignment to this shared folder. Did you encounter any problems accessing the shared Dropbox folder that I shared with you? If you did encounter problems, we will troubleshoot when we meet on Zoom between January 22–31.

Question 21: Since this is the first week, also e-mail your assignment to me at as a backup.

1. Image source: https://i.stack.imgur.com/hvTdo.png

2. An observation is a row of your data when it is in a spreadsheet. A row of data can be a person, an organization, a group, a car, or anything else really about which data has been collected. An observation is also sometimes called a data point.

3. Range is the difference between the smallest and largest value in a sample of data

4. I know this is annoying but it’s important to do once.

5. Hint intended to help you but still make you think: Imagine a new unit of length called a blah. One blah is 3 cm. If you are 27 cm tall and I am 18 cm tall, we use subtraction to figure out that you are 9 cm taller than me. How many blahs shorter than you am I? Well, we have to convert 9 cm into blahs: (9 cm)/(3 cm/blah) = 3 blahs. In the previous question, you calculated how many cm are in a standard deviation. Just replace blah with standard deviation in this example to get the answer to the question (and replace my height with 140 cm and your height with the mean).

6. This is just like the earlier question in which you plugged values of x into the linear equation.

7. R is a data analysis platform that is open source and free to use.

8. As you may have guessed and/or noticed, this entire document is written using R, which allows me to show you commands and results easily.

9. I will circulate everyone’s statements by e-mail or D2L. It will not be made public.

10. Each folder will have just two people who have access to it: one student in the course and me.