# Responsible applied statistics in R for behavioral and health data (working title)

*Version – 15 August 2022*

# Information and Reference

This textbook accompanies the course HE-902—

*Advanced Statistical Modeling for Health Professions Education*—in the PhD in HPEd program at MGH Institute of Health Professions. HE-902 is a statistics course that equips students to analyze healthcare and/or behavioral data in R. This online textbook is the main resource to guide you through the course HE-902 in the Spring 2023 semester.The purpose of this textbook is to prepare users to

*responsibly*apply data analysis methods to their work. Beyond just making a regression model and interpreting its results, it is important to know when we can and cannot*trust*these results. It is critical to know which claims you can and cannot make based on results that the computer gives you. The goal of this textbook is to equip you with all of the tools you need to responsibly apply and interpret regression models.Each chapter in this textbook contains reading (or links to reading/videos) that you should learn as well as an assignment that you should complete and submit by the deadline in the course calendar.

My name is Anshul Kumar and I am the author/preparer of this textbook. You can reach me at akumar@mghihp.edu with any questions, comments, and/or suggestions for modifications to this textbook.

All of the materials here are available online for anybody to use. Those who are not part of the course HE-902 are welcome to use this textbook if it is useful. Please e-mail me any feedback you have.

I use a lot of footnotes like the one after this sentence.

^{1}You can read the footnote by clicking on the small-but-tall number in between this sentence and the previous one. Footnotes contain comments from me or extra information that might be helpful. But footnotes in this textbook are never required for you to read.**It is fine for you to skip the footnotes and not read them.**Please note that many of the examples of data and research that you will encounter in this textbook use a binary, inappropriately narrow, and/or potentially problematic conceptualization of sex, gender, and other individual-level characteristics. My personal view is that this is often not the best way to organize data or present examples. Furthermore, our collective understanding of sex, gender, race, ethnicity, and other individual characteristics is constantly changing, and it is critical for all of us as data analysts and researchers to engage in related discussions and initiatives. I am always looking for materials that represent a more inclusive framework and I do update materials when possible. I welcome your input and suggested alternatives.

^{2}At IHP, we recognize that students who observe religious and cultural holidays may not be able to complete their assignments or study for exams during these holidays. We will do our best to be flexible and accommodate your schedule related to your observation of holidays. Please just let us know about your expected schedule.

## 0.1 Course calendar 2023

The calendar below shows which chapter to use each week, when weekly assignments are due, and when additional required activities must happen (oral exams and final project). Keep in mind that this calendar might change during the semester.

Each week starts on a Monday and ends on a Sunday.

**Each week, please read and follow all instructions in the corresponding chapter. Then complete the assignment at the end of the chapter and submit it in D2L. Also complete any additional tasks such as oral exams or your final project.**

Week of |
Week/Chapter # |
Assignment/Activities |
---|---|---|

Jan 2 | 1 | Assignment in R. |

Jan 9 | 2 | Assignment in R. |

Jan 16 | 3 | Assignment in R. |

Jan 23 | 4 | Assignment in R. |

Jan 30 | 5 | Assignment in R. Schedule oral exam #1. |

Feb 6 | 6 | Assignment in R. Do oral exam #1. |

Feb 13 | 7 | Assignment in R. Do oral exam #1 (if not already complete) |

Feb 20 | 8 | Assignment in R. Schedule oral exam #2. |

Feb 27 | 9 | Assignment in R. Do oral exam #2. |

Mar 6 | 10 | Assignment in R. Do oral exam #2 (if not already complete) |

Mar 13 | 11 | Assignment in R |

Mar 20 | Spring Break | No new content or assignment. |

Mar 27 | 12 | Assignment in R. Schedule oral exam #3. |

Apr 3 | 13 | Assignment in R. Do oral exam #3. |

Apr 10 | 14 | Assignment in R. Do oral exam #3 (if not already complete) |

Apr 17 | None | Final project is due April 24 2023. |

TBA = to be announced

Note that the week of April 10 2023 (Week 14) will be the final week in which there is new material given.

The final project is due on Monday, April 24 2023.

Students are responsible for scheduling their three oral exams on or near the scheduled weeks in the calendar, by e-mailing all course instructors.

All assignments are due on the Sunday at the end of each week. For example, Week 3 of the class starts on Monday January 16 2023; the assignment for that week is due on Sunday January 22 2023. Assignments can be submitted at any time on the day they are due.

## 0.2 Useful links

Link to this HE-902 course textbook: https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/ or http://tinyurl.com/he902stats.

HE-802 at MGHIHP – Elective statistics course in our M.S. in HPEd program

HE-942 onsite seminar Quantitative Methods Workshop: 2019, 2020, 2021

MGHIHP Academic Calendar PDF (available via the academic calendars page)

Submit a help desk ticket to MGHIHP (for technical support related to Zoom, your IHP e-mail or other accounts, and similar issues): submit ticket, instructional video

D2L site for HE-902 spring 2022 at MGH Institute of Health Professions

- You can find assignment submission drop boxes by clicking on: Assessments -> Assignments

## 0.3 Assignments, grading, and curriculum

### 0.3.1 Grade calculation

Type of Work | % of Grade |
---|---|

Weekly assignments | 25 |

Participation | 5 |

Oral exams | 45 |

Final project | 25 |

Your grade will be calculated as shown in the table above.

### 0.3.2 Description of curriculum

Here are descriptions of the activities you will do in this course:

**Weekly assignments**: Homework assignments will involve applying/practicing the statistical technique(s) that is the focus of the week using a provided dataset. Because of the cumulative nature of the course, these assignments may also involve applying knowledge from previous weeks’ material. Some assignments will help you prepare for your final project. Assignments will be posted within the online textbook and students will submit completed assignments in D2L. Note that if you have data of your own that you are interested in analyzing, you can often use your own data instead of the provided data for the weekly assignments. Please discuss this with the instructors as desired. As long as you are adequately practicing the new skills each week, it doesn’t matter which data you use. If you find yourself spending too much time on an assignment or spending more than 30 minutes on a single task in an assignment, please do not hesitate to contact us so that we can assist you in finishing each task and moving on to the next one.**The purpose of weekly assignments is to discuss or practice skills. Coding assignments are mostly graded based on your completion of the work. An instructor will look at your submitted work for each assignment, but might not give you any feedback on it if you have fulfilled the requirements. If you ask a question or we notice any omissions or errors, then we will give you feedback accordingly.****Participation**: This mostly relates to your participation in mandatory video calls over Zoom. These will include welcome/orientation sessions, oral exams, and occasional group meetings with classmates. Optional office hours or scheduled meetings with instructors will also count towards participation.**Oral exams**: You must take three oral exams, each occurring approximately during the weeks specified in the calendar. Each exam will be a one-on-one Zoom (online video) meeting between a student and an instructor. During the exam, you will be asked to show your understanding of the concepts we cover and you will be asked to demonstrate data analysis tasks on your own computer while sharing your screen on Zoom. The exams are “open-book,” meaning that you can refer to any notes or course materials during the exam. You are allowed to re-take an exam as many times as you would like, if you are not satisfied with your initial performance. If you want to get your exam dates on the calendar early, just e-mail the instructors and we can definitely do that.

**Final project**: You will develop and execute an abbreviated version of a full quantitative research study and submit it as a final project at the end of the term. This project will include research questions, study hypotheses, sample size/power, gathering or finding the necessary data (this can be instructor-provided if you prefer), and identifying analyses necessary to test hypotheses. Students will then conduct and interpret a full data analysis and write up results, including tables or figures, as appropriate. Some of the homework assignments will help you complete this project. Students who turn in their final project early can receive feedback and then resubmit the project after incorporating the feedback (with the grade of the resubmission being recorded as the final project grade). Details about this project will be posted within the online textbook and students will submit completed projects in D2L.

## 0.4 Oral exams information

Note that information for each exam might be modified or updated in the few weeks prior to when it occurs in our course schedule.

### 0.4.1 Skills covered in the exams

Here are some notes about what to focus on as you prepare for each Oral Exam in this course:

Oral Exam #1: Oral Exam #1 relates to content up to and including Chapter 5. Please come to Oral Exam #1 prepared to

**run**(in R) and**interpret**the results of OLS linear regression models, as well as run (in R) and interpret the results of diagnostic tests of OLS assumptions. This is the primary focus of Oral Exam #1. A few other questions will also be asked, but they are secondary to the primary focus. You should also be ready to create and interpret a two-way table and import data into R from an Excel file.**If you do not feel ready to demonstrate these skills by the scheduled day of your exam, you should postpone your exam to a later date, which is completely acceptable**(and use the additional time to ask any questions and/or review content). I recommend that you practice these skills on your own prior to the exam.Oral Exam #2: Oral Exam #2 relates to content up to and including Chapter 8, with a primary focus on skills that were not covered in Oral Exam #1. Please come to Oral Exam #2 prepared to

**run**(in R) and**interpret**the results of logistic regression models, regression models that model nonlinear data, and regression models that model interactions. This is the primary focus of Oral Exam #2. Questions about causal inference and model specification will also be asked, but they are secondary to the primary focus. You should also review and be prepared to interpret a two-way table.**If you do not feel ready to demonstrate these skills by the scheduled day of your exam, you should postpone your exam to a later date, which is completely acceptable**(and use the additional time to ask any questions and/or review content). I recommend that you practice these skills on your own prior to the exam.Oral Exam #3: Oral Exam #3 relates to content up to and including Chapter 12, with a primary focus on skills that were not covered in Oral Exams #1 and #2. Please come to Oral Exam #3 prepared to

**run**(in R) and**interpret**analyses related to multilevel and/or clustered data (this includes intra-class correlation, fixed effects, and random effects), and**interpret**the results of ordinal and multinomial logistic regression models. This is the primary focus of Oral Exam #3. A few other questions will also be asked, but they are secondary to the primary focus.**If you do not feel ready to demonstrate these skills by the scheduled day of your exam, you should postpone your exam to a later date, which is completely acceptable**(and use the additional time to ask any questions and/or review content). I recommend that you practice these skills on your own prior to the exam.

### 0.4.2 Details and tips about the exams

**How should I prepare for an oral exam? What will it test?**

- The exam will test your knowledge of and skills related to the quantitative methods we have learned in the chapters corresponding to each exam.
- The exam will test your ability to conduct some but not all of these methods in R.
- You can refer to any materials you would like—course book, your notes and assignments, online sources, anything else—during the exam. Therefore, I suggest that you refresh your memory about
*where*you can find R code that you need, so that you can copy and paste it easily. You can also copy and paste from your homework assignments during the exam. Beyond that, I recommend reviewing the statistical concepts and their**interpretations**more than R code. - More weight will be given to your knowledge and interpretation of statistical concepts than your ability to write R code.

**What will happen during the exam?**

- The exam will likely take 1–2 hours,
^{3}depending on your level of familiarity with the content. - We will ask you a few theoretical questions.
- We will ask you to open a dataset in R and run a few data preparation commands in it.
- We will ask you to conduct a series of statistical tests in R and interpret the results.
- We will tell you after each task/question if you were correct or not. You can re-try some tasks on the spot if you wish. Other items can be flagged for re-testing later.
- If you need to have anything re-tested, we can schedule a separate time for that or you can re-test some topics during a subsequent exam (for example, if you miss any questions on Oral Exam #1, we can re-test you on those during your Oral Exam #2).
- We will try to be efficient during the exam so that we don’t run out of time.

**Why do we do oral exams in this course?**

- Since ours is an asynchronous course in which we do not have regular class meetings, we instructors do not have an opportunity to interact with each of you to make sure that you are comfortable with all of the content and skills. If this were an in-person course with regular class meetings, we would use a flipped classroom model in which you would read this textbook and watch videos before coming to class and then we would do lab-style activities during the class meeting in which you would apply skills in R and interpret results, together as a group. The three Oral Exams in this course are used to replace these in-person interactions and make sure that each student has at least 3–6 hours of interaction with an instructor while doing quantitative analysis during the semester.

**How and when do I schedule my oral exams?**

- Please e-mail ALL course instructors with times that work for your schedule, a week or two before you want to do each exam.
- The recommended timings of the oral exams are given in the course calendar.

## 0.5 Final project details and requirements

This section contains specific expectations and requirements for the final project in this class.

These requirements are final for spring 2022.

### 0.5.1 Description

The final project is *not* meant to be even close to a full quantitative research study. Another way to think about the project is that you will be writing an extended methods section and a condensed results section of an empirical research article, without writing a literature review or anything else.

Note that you can ask for help from instructors as you do the project. It is not like an exam in which you have to do each skill alone. You can send us drafts of your project at intermediate stages and we can give you feedback. We can also meet individually to discuss and advise about your project.

The requirements for the project can be changed for individual students on a case-by-case basis and only with permission from Anshul. If you think that a modification to the goals and/or requirements below may be more productive for your own professional goals and/or your goals in the PhD program, please discuss this with Anshul.

**The due date for the final project can be found in the course calendar.**

### 0.5.2 Project goals

**The goals of this final project are to…**

Present and interpret the results of

*one*regression model that answers a clear and specific research question.Run, interpret, and appropriately respond to

*all*required diagnostic tests for this regression model and present the results of all tests.

### 0.5.3 Project requirements

**Here are the items you must present and tasks you must complete:**

Write a clear research question (RQ) that can be solved using regression analysis techniques. This research question should be a

*single sentence*with a question mark at the end.^{4}To answer your RQ, relevant concepts will have to be first measured, recorded in a dataset as variables, and then related to each other quantitatively. Present a DAG (directed acyclic graph) that shows the relationship between all of the variables that will be involved in answering your RQ.

^{5}The DAG you present does not need to be aesthetically pleasing.^{6}Identify a dataset that you will use to answer your research question. Clearly describe the dataset, including: a) population from which the data sample was drawn, b) unit of observation, c) all variables that you will use in your analysis and the unit of measurement of each variable, d) background information about the data.

Given the structure of the data and the RQ of interest, explain which type of regression model is most appropriate to answer your RQ and why. Also identify at least one other type of regression model that could also be used and explain why you instead chose the type of model that you did.

Present basic descriptive statistics that are relevant to your RQ. You should include at least one table and at least one figure/chart. All variables from your DAG must be included in these descriptive statistics.

Show the R code and results of

*one*regression model that answers your RQ.^{7}Run and present the results of

*all*diagnostic tests that pertain to the type of regression model you ran. Your regression model*must pass*all of the tests. If your diagnostics show that your model specification violates any of the assumptions of your regression model, you should fix the problem and try again until your model no longer violates the assumptions.^{8}Interpret the results of your regression model that are relevant to your RQ.

Briefly explain any limitations in your analysis.

Include all R code and results in your final submission.

Present all writing in well-written English.

Present everything in an aesthetically pleasing manner (with the exception of your DAG). It is recommended that you use an RMarkdown document, but this is not required.

### 0.5.4 Grading rubric

The final project will be graded according to the rubric below. Each criterion is worth a maximum score of two points unless otherwise noted.

Criterion | Score = 0 | Score = 1 | Score = 2 |
---|---|---|---|

Clear RQ | Unclear, more than one sentence, not a question. | Confusingly presented but understandable. | Clear, simply written, single sentence ending with a question mark. |

DAG | Not all relevant variables are shown, items that are not variables in the data are shown, arrows do not make sense, unblocked backdoor paths are present, graph is cyclical. |
Minor errors. | DAG clearly shows hypothesized relationship among all variables, including any confounding, mediating, separately-acting, or unmeasurable variables. |

Population and sample | Relationship between sample and population is unclear, details about population is omitted. | Minor omissions, but overall description of the population is understandable. | It is very clear what the population is and how many observations from this population were sampled and then included in the dataset used in the project. |

Unit of observation | The meaning of each row in the data is not understandable from what is written. | Reader can figure out based on context, but a clear explanation is missing. | It is very clear what each row of the data means/represents. This is explicitly stated with no ambiguity or confusion. |

Variables used | The variables used in the analysis are not addressed. | Some variables are mentioned but not all. How each variable is measured is not clear. | Dependent variable and all independent variables are described in one sentence each. Unit of measure (and any relevant explanation of how a variable is coded in the data) is given for each variable. |

Background on data | It is not understandable where the data came from and from what context. | Few details are given about the data. | Clear explanation of where the data came from, when it was collected, who collected it, etc. |

Choose model | No explanation of why the presented regression model was chosen. No comparison to another regression model. Incorrect selection of model type. | An explanation may be there but it might be incorrect, or a comparison to another model is missing. | Logical explanation of the way the data is structured and how the selected regression model is best suited to that data structure. Clear explanation of why at least one alternative regression model was not used. |

Descriptive statistics | No or very few statistics presented. Statistics for irrelevant variables or information are presented. | Descriptive statistics do not cover all variables and observations relevant to the RQ. Only one of two required charts is included. | Descriptive statistics are presented for all variables listed in the DAG and used in the regression model. One well-made figure is presented. One well-made table is presented. |

Regression model | Code and/or summary is not shown for regression. Code does not accomplish the type of regression that was supposed to be used. | Only partial work or result is shown. Type of regression model is unclear. | Correct regression model result is shown along with appropriate R code to execute it. |

Multicollinearity test | Test not presented. | Results do not satisfy assumption. Results presented but interpreted incorrectly. | Regression model passes the VIF test of multicollinearity. A correlation matrix should be presented if necessary, but this is not required. |

Independence assumption | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | The issue of independence of sampled observations is carefully discussed. Non-independent data should be accounted for appropriately in the regression model. |

Other assumption 1 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |

Other assumption 2 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |

Other assumption 3 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |

Interpret results | Many irrelevant details are given. Research question is not clearly answered. | Research question is answered but interpretation of results is not exactly correct. | Succinct interpretation of the portion of the regression output that pertains to the RQ. |

Limitations | Limitations are not addressed or are completely incorrect given the regression model used. | Limitations are partially addressed. | Multiple plausible limitations to the analysis and the conclusions we can draw from it are addressed. |

R code included | No R code is included | Only partial R code for the results presented is given. | R code is included (displayed in final document) for all results that were generated using R. |

Writing quality (+) | Sentences and paragraphs are not formatted according to convention. Full sentences are not used much or well. | Minor grammar and/or spelling errors occur throughout, but the main points are understandable. | Writing is clear and succinct. It is easy to read quickly and understand the analysis and the results. No grammar or spelling errors. |

Aesthetics | Project is presented in a confusing manner. Order and flow of requested items is not logical. Unnecessary fonts, symbols, and formatting layout appears. | Minor blemishes and errors are visible in the submitted project. | Order of all content is clear and logical. Sections and sub-sections are logically and clearly marked. The write-up is easy to read. |

Items marked with a (+) in the table above will carry more weight than just two points. All other items have a maximum score of two points.

Your grade on the project will be the number of points achieved divided by the total number of points possible.

**If you are not satisfied with your grade on this project, you do have the option of taking an INCOMPLETE grade for the course.** Then, you will improve and re-submit your project in the weeks that follow the end of the course. We will re-grade the project and then put your improved final grade for the course into the grading system. Please be sure to communicate with course instructors about this option right away, if you think you might choose it.

## 0.6 Required materials

*No purchases are necessary (as long as you already have a computer that can run the necessary software). All necessary materials are either a) free and available online or b) provided to you by the Institute for free.*

**Texts/Readings**: There is no requirement for students to purchase any texts or readings for this class. All materials are available freely online.**Statistical software**: All students are required to install R and RStudio on their own computer. R is a free and open-source statistical computing platform and RStudio is a free software that makes R easier to use. Instructions for installing R and RStudio will be provided by the instructor. R and RStudio should run well on computers running Mac, Windows, and Linux operating systems.**Videoconferencing software**: All students will have access Zoom Pro based on enrollment in the PhD-HPEd program at MGHIHP; it should be installed on the student’s computer for group and/or individual meetings with the instructors and classmates.**Spreadsheet software**: Some exercises or data manipulation in this course may need to be done in Excel or a similar spreadsheet software. Students can complete this work in any spreadsheet software such as Microsoft Excel, Google Sheets within Google Drive (free), or Open Office (free).

## 0.7 Always remember

Below is a list of items that it is critical to always remember, once you have finished learning many of the topics in this textbook.

Every p-value is the result of a hypothesis test. Every time you see a p-value, figure out the null and alternate hypotheses. \(1-p\) is the level of support for the alternate hypothesis. Click here to read more about it.

Magic words to interpret linear regression results:

*For every one unit increase in \(x\) (an independent variable), \(y\) (dependent variable) is predicted to change on average by \(b_x\) (slope) units, controlling for all other independent variables.***DO NOT**reverse the positions of X and Y in these words. Read more.Magic words to interpret odds ratios: For every one unit increase in \(x\) (an independent variable), the

*odds*of \(y\) (event or outcome) are predicted to change on average by \(OR\) (odds ratio)*times*, controlling for all other independent variables.**DO NOT**reverse the positions of X and Y in these words. Read more.

All quantitative results are only telling us associations between variables,

**NOT**causal relationships. Read more.No regression model conclusions can be trusted for inference until the model passes

**ALL**required diagnostic tests of model assumptions.Use estimated effect sizes and confidence intervals—

**NOT**p-values—to determine if quantitative results are meaningful and relevant. Read about effect size, read about confidence intervals.Unmeasured confounding,

^{9}selection bias, measurement error,^{10}and model misspecification are four possible problems that can make quantitative results un-trustworthy. Read more.You must know the structure of your data very well. Otherwise, clustering, non-independence, Simpson’s paradox, and other problems could cause you to draw false conclusions. Read more.

All quantitative analyses have limitations. Be sure to report what you did transparently and responsibly.

## 0.8 Orientation session

All students have been asked to attend an orientation/welcome session. This section contains some reference materials for use during that session. We will go through the items below together during our meeting. A series of videos that approximately demonstrate this orientation procedure are at the end of this section, in case you happen to be doing this procedure on your own instead of in a group.

Make sure you do the following orientation procedure on the same computer that you plan to use during the course.

### 0.8.1 Basic use

We’ll start with some basics to get everyone set up and navigate within RStudio.

**Task 1**: Download and install the appropriate version of R from https://cran.r-project.org/mirrors.html. Choose any link in your country or a nearby country.

You do NOT need to open R, which you just installed, on your computer. R just needs to be there. We won’t actually use it directly. We will use RStudio instead.

**Task 2**: Download and install the appropriate version of RStudio from https://rstudio.com/products/rstudio/download/. Click on the `Download`

button for the free version of RStudio Desktop.

Now that you have RStudio downloaded and installed on your computer, you can open RStudio (not R, which you installed previously) on your computer and proceed to the next task.

**We will use RStudio (not R) for all of our work in this course.**

**Task 3**: Create a new R file.

To create the new R file, click on `File`

-> `New File`

-> `R Script`

.

You can think of an R file as a Microsoft Word file but for data analysis.

**Task 4**: Save the new R file you just created. You could also create a new folder for all your work in our class, if you want, and save your R file in there.

**Task 5**: Find your new file within the `Files`

tab within RStudio.

**Task 6**: Type `2+3`

into a line of your new R file and then click the `Run`

button. Locate the console and see what happened there. It should look approximately like this:

`2+3`

`## [1] 5`

**Task 7**: In a new line of your R file, type `2*3`

and click on `Run`

. You should see something like this in the console:

`2*3`

`## [1] 6`

As you can see, R can help us do basic arithmetic. You could also type `2-3`

or `2/3`

for subtraction and division. And you can of course change the numbers `2`

and `3`

to other numbers.

### 0.8.2 Data loading and viewing

Now it’s time to load some data and look at it together!

**Task 8**: Let’s make a new dataset called `d`

, using the code below.

Type this code into a new line of your R file and run it:

`<-mtcars d`

Now look in your `Environment`

tab in RStudio. You should see a new data item called `d`

.

What we did using the command above is that we created a new data object called `d`

and we assigned `d`

to contain the data within `mtcars`

. `mtcars`

is a dataset that comes along with R. It is built-in for us to practice with, like we are doing now!

**Task 9**: You can run the command `?mtcars`

to pull up some information about this data, if you would like.

**Task 10**: Type `View(d)`

(with a capital `V`

) into your R code file (in a new line, of course!) and run it.

A data viewer could pop up in a new tab, showing you your dataset in spreadsheet view.

Close the data viewer.

**Task 11**: Double click on `d`

in the `Environment`

tab. This should also bring up the data viewer.

**Task 12**: Now let’s inspect our data in the data viewer. What are some of your observations?

**Task 13**: What does each row in the data represent?

**Task 14**: What does each column in the data represent?

### 0.8.3 Descriptive statistics

Let’s go back to the R code file tab and do some simple analysis on our data.

**Task 15**: Type `d$`

into your code file (in a new line, of course). A list of columns (variables) should pop up. Select `mpg`

.

Now your line of code should say `d$mpg`

.

Go ahead and run that line. You should get a result like this:

`$mpg d`

```
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
```

Above, the computer gave you all of the data just from the column labeled `mpg`

from our data.

**Task 16**: Now let’s calculate the mean of the column (variable) `mpg`

.

`mean(d$mpg)`

`## [1] 20.09062`

Remember, put the command above into a new line of your code file and then run it. And you should see the result above in your console.

**Task 17**: How about the standard deviation of `mpg`

? See below.

`sd(d$mpg)`

`## [1] 6.026948`

We have now calculated the mean and standard deviation of the `mpg`

variable.

**Task 18**: We can also get a summary of the `mpg`

variable as follows. Please add it to your code file and try it out!

`summary(d$mpg)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
```

The output above gives us a number of metrics, although it does not include standard deviation.

**Task 19**: Try running the `mean`

, `sd`

, and `summary`

commands for a different variable that you choose.

### 0.8.4 Simple visualizations

We will continue with our brief exploration of our data by making some visualizations.

**Task 20**: Run the command `hist(d$mpg)`

(again, make sure you’re adding the command in a new line of your code file). You should get a histogram in your `Viewer`

in RStudio.

`hist(d$mpg)`

**Task 21**: And now let’s make a scatterplot using `plot(d$wt,d$mpg)`

.

`plot(d$wt,d$mpg)`

Above, we see that the more a car weighs (`wt`

), the worse its fuel efficiency (`mpg`

).

**Task 22**: Try making some other histograms and scatterplots!

### 0.8.5 Installing and loading packages

Finally, we will practice installing and loading packages to make sure this functionality is working correctly on your computer.

As always, make sure that you put all new lines of code into your R code file.

**Task 23**: Run the command `install.packages("car")`

. You should then see a bunch of output in the console. The process might take a while.

Once the process is over, the `car`

package should be installed on your computer. You can see a list of installed packages in the `Packages`

tab, if you want. There will be many packages there.

**Task 24**: Finally, run the command `library(car)`

to load the `car`

package into your current R session.

You have reached the end of the R and RStudio orientation procedure!

### 0.8.6 Video demonstration of orientation procedure

Below are a series of videos that approximately demonstrate the orientation procedure above.

#### 0.8.6.1 Basic use

The following video will walk you through the first steps of installing and using R and RStudio.

The video above can be viewed externally at https://youtu.be/_FVq-Vyyyfs.

#### 0.8.6.2 Data loading and viewing

This video demonstrates the process of loading and viewing data in R and RStudio:

The video above can be viewed externally at https://youtu.be/A9mQSqbx2hE.

#### 0.8.6.3 Descriptive statistics

In the video below, you can see how to calculate some basic descriptive statistics in R and RStudio.

The video above can be viewed externally at https://youtu.be/SNF01zoJ42I.

#### 0.8.6.4 Simple visualizations

The video below demonstrates how to make a few simple visualizations—histograms and scatterplots—in R and RStudio. It also shows one possible way to close RStudio on your computer.

The video above can be viewed externally at https://youtu.be/qS3Fh2S7Kd4.

#### 0.8.6.5 Installing and loading packages

The following video shows how to install a package and then load that package in R and RStudio:

The video above can be viewed externally at https://youtu.be/66HUNn4-92k.

## 0.9 Acknowledgments

The building blocks for this textbook are taken from A Minimal Book Example by Yihui Xie. This work would not be possible without this excellent guide from Yihui.

Much of the content for this book is influenced by the teaching and research conducted by my colleagues and students.

The various efforts of Roger Edwards, Nicole Danaher-Garcia, Grace Ming, Valay Maskey, Tony Sindelar, and all students who have taken the course HE902 at MGHIHP in the past have been particularly instrumental in the development of this course and online textbook.

This is a footnote. You can go back to where you were reading by clicking on the little squiggle arrow right here: ↩︎

This paragraph includes input from experts at the JEDI office and instructional design team at MGHIHP, who were open to engaging in discussions about this topic.↩︎

But it could take less than an hour and it could take more than two hours. Both have happened before and are perfectly fine. The goal is not to be fast. The goal is to do everything at a high quality level.↩︎

There are no exceptions to this requirement.↩︎

You SHOULD include variables that you theoretically believe are related to your RQ, even if you don’t have those variables in your dataset.↩︎

You can even submit a photo of a hand-drawn DAG if you would like. All that matters is the content of your DAG, not the presentation.↩︎

In reality, you will likely run many regression models on your own to arrive at the one that fits your data the best. But you do not need to show all of this work in your final submission. If you do wish to show all of this additional work, you can include it in an appendix to your assignment, but this is not required.↩︎

It is common to take advice from an instructor during this process. Feel free to contact us during this or any other part of your final project work.↩︎

Unmeasured confounding can cause omitted variable bias.↩︎

Measurement error is also called information bias.↩︎