Responsible applied statistics in R for behavioral and health data
Version – 25 July 2024
Information and Reference
Welcome to HE-902 at MGH Institute of Health Professions! Please watch the following welcome video:
This video can be viewed externally at https://youtu.be/o_DVZ8GplGY.
Please additionally read the following items:
This textbook accompanies the course HE-902—Advanced Statistical Modeling for Health Professions Education—in the PhD in HPEd program at MGH Institute of Health Professions. HE-902 is a statistics course that equips students to analyze healthcare and/or behavioral data in R. This online textbook is the main resource to guide you through the course.
The purpose of this textbook is to prepare users to responsibly apply data analysis methods to their work. Beyond just making a regression model and interpreting its results, it is important to know when we can and cannot trust these results. It is critical to know which claims you can and cannot make based on results that the computer gives you. The goal of this textbook is to equip you with all of the tools you need to responsibly apply and interpret regression models.
My name is Anshul Kumar and I am the author/preparer of this textbook. You can reach me at akumar@mghihp.edu with any questions, comments, and/or suggestions for modifications to this textbook.
All of the materials here are available online for anybody to use. Those who are not part of the course HE-902 are welcome to use this textbook if it is useful. Please e-mail me any feedback you have.
I use a lot of footnotes like the one after this sentence.1 You can read the footnote by clicking on the small-but-tall number in between this sentence and the previous one. Footnotes contain comments from me or extra information that might be helpful. But footnotes in this textbook are never required for you to read. It is fine for you to skip the footnotes and not read them.
Please note that many of the examples of data and research that you will encounter in this textbook use a binary, inappropriately narrow, and/or potentially problematic conceptualization of sex, gender, and other individual-level characteristics. My personal view is that this is often not the best way to organize data or present examples. Furthermore, our collective understanding of sex, gender, race, ethnicity, and other individual characteristics is constantly changing, and it is critical for all of us as data analysts and researchers to engage in related discussions and initiatives. I am always looking for materials that represent a more inclusive framework and I do update materials when possible. I welcome your input and suggested alternatives.2
At IHP, we recognize that students who observe religious and cultural holidays may not be able to complete their assignments or study for exams during these holidays. We will do our best to be flexible and accommodate your schedule related to your observation of holidays. Please just let us know about your expected schedule as soon as possible.
0.1 Course calendar 2024
Everything you need to do in each week of this course can be found a) in the calendar below and b) in the chapter of this textbook corresponding to each week. Keep in mind that this calendar might change during the semester.
Each week starts on a Monday and ends on a Sunday.
Week of | Week/Chapter # | Assignment/Activities |
---|---|---|
Jan 1 | 0 | Complete orientation procedure (no submission) |
Jan 8 | 1 | Complete orientation procedure (if not already). Assignment in R. 15 flashcards. |
Jan 15 | 2 | Assignment in R. 15 flashcards. |
Jan 22 | 3 | Assignment in R. 15 flashcards. |
Jan 29 | 4 | Assignment in R. 15 flashcards. |
Feb 5 | 5 | Assignment in R. 15 flashcards. Schedule oral exam #1. |
Feb 12 | 6 | Assignment in R. 15 flashcards. Do oral exam #1. |
Feb 19 | 7 | Assignment in R. 15 flashcards. Do oral exam #1 (if not already complete). |
Feb 26 | 8 | Assignment in R. 15 flashcards. Schedule oral exam #2. |
Mar 4 | 9 | Assignment in R. 15 flashcards. Do oral exam #2. |
Mar 11 | 10 | Assignment in R. 15 flashcards. Do oral exam #2 (if not already complete). |
Mar 18 | 11 | Assignment in R. 15 flashcards. |
Mar 25 | 12 | Assignment in R. 15 flashcards. Schedule oral exam #3. |
Apr 1 | 13 | Assignment in R. 15 flashcards. Do oral exam #3. |
Apr 8 | 14 | Assignment in R. 15 flashcards. Do oral exam #3 (if not already complete). |
Apr 15 | None | Final project is due April 17 2024. |
Optional clarifications:
Note that the week of April 8 2024 (Week 14) will be the final week in which there is new material given.
How do I schedule oral exams? Students are responsible for scheduling their three oral exams on or near the scheduled weeks in the calendar, by e-mailing all course instructors.
When are assignments due? All assignments are due on the Sunday at the end of each week. For example, Week 3 of the class starts on Monday January 15 2024; the assignment for that week is due on Sunday January 21 2024. Assignments can be submitted at any time on the day they are due.
Why is the schedule like this? You might have noticed that our class starts earlier than the rest of the classes in the spring semester at MGHIHP. This early start is optional but highly recommended. Based on prior student feedback and our experience teaching the course, we start the course early to help students get acquainted with R, RStudio, and review some basic prerequisite topics. It is fine to start the course later, if your schedule does not permit an early start. Even if you do only start in Week 2, you will still be responsible for making up the work from all chapters in the curriculum. Please discuss with the instructors if you would like to work on an alternate timeline.
0.2 Useful links
- Link to this HE-902 course textbook: https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/ or http://tinyurl.com/he902stats.
D2L site for HE-902 in spring 2024 at MGH Institute of Health Professions. You can find assignment submission drop boxes by clicking on: Assessments -> Assignments.
Adaptive Learner App: https://educ-app-2.vercel.app/
Chat GPT: http://chat.openai.com
0.3 Grading, assignments, and curriculum
0.3.1 Grade calculation
Your grade will be calculated as shown in the table below.
Type of Work | % of Grade |
---|---|
Weekly homework assignments | 35 |
Weekly flashcard practice completion | 15 |
Oral exams | 30 |
Final project | 20 |
0.3.2 Description of curriculum
Here are descriptions of the activities you will do in this course:
- Weekly homework assignments: These assignments will involve applying/practicing the statistical techniques taught each week week using a provided dataset. Because of the cumulative nature of the course, these assignments may also involve applying skills and knowledge from previous weeks. Some assignments will help you prepare for your final project. Assignments will be posted within the online textbook and students will submit completed assignments in D2L. Note that if you have data of your own that you are interested in analyzing, you can often use your own data instead of the provided data for the weekly assignments. Please discuss this with the instructors as desired. As long as you are adequately practicing the new skills each week, it doesn’t matter which data you use. If you find yourself spending too much time on an assignment or spending more than 30 minutes on a single task in an assignment, please do not hesitate to contact us so that we can assist you in finishing each task. The purpose of weekly assignments is to discuss or practice skills. Coding assignments are mostly graded based on your completion of the work. An instructor will look at your submitted work for each assignment, but might not give you any feedback on it if you have fulfilled the requirements. If you ask a question or we notice any omissions or errors, then we will give you feedback accordingly.
- Weekly flashcard practice: During the orientation process for this course, you will sign up for the online Adaptive Learner App.3 This App has a section specifically for students in HE-902. The App contains flashcards with practice quiz questions for you to answer each week. You are required to answer at least 15 flashcard questions each week (with the option of answering more if you wish). This requirement might change during the course. The goals of this formative assessment activity are to a) build and retain knowledge about quantitative analysis topics and b) identify content and skill areas that require additional practice or attention. Your actual score on the flashcard quiz questions will NOT impact your final grade. You are only required to complete the flashcards each week, but your score doesn’t matter for your grade. Please provide any feedback on this learning activity and associated technology to Anshul at akumar@mghihp.edu .
- Oral exams: You must take three oral exams, each occurring approximately during the weeks specified in the calendar. Each exam will be a one-on-one Zoom (online video) meeting between a student and an instructor. During the exam, you will be asked to show your understanding of the concepts we cover and you will be asked to demonstrate data analysis tasks on your own computer while sharing your screen on Zoom. The exams are “open-book,” meaning that you can refer to any notes or course materials during the exam. You are allowed to re-take an exam as many times as you would like, if you are not satisfied with your initial performance. If you want to get your exam dates on the calendar early, just e-mail the instructors and we can definitely do that.
- Final project: You will develop and execute an abbreviated version of a full quantitative research study and submit it as a final project at the end of the term. This project will include research questions, study hypotheses, sample size/power, gathering or finding the necessary data (this can be instructor-provided if you prefer), and identifying analyses necessary to test hypotheses. Students will then conduct and interpret a full data analysis and write up results, including tables or figures, as appropriate. Some of the homework assignments will help you complete this project. Students who turn in their final project early can receive feedback and then resubmit the project after incorporating the feedback (with the grade of the resubmission being recorded as the final project grade). Details about this project will be posted within the online textbook and students will submit completed projects in D2L.
0.4 Oral exams information
Note that information for each exam might be modified or updated in the few weeks prior to when it occurs in our course schedule.
0.4.1 Skills covered in the exams
Here are some notes about what to focus on as you prepare for each Oral Exam in this course:
Oral Exam #1: Oral Exam #1 relates to content up to and including Chapter 5. Please come to Oral Exam #1 prepared to run (in R) and interpret the results of OLS linear regression models, as well as run (in R) and interpret the results of diagnostic tests of OLS assumptions. This is the primary focus of Oral Exam #1. A few other questions will also be asked, but they are secondary to the primary focus. You should also be ready to create and interpret a two-way table and import data into R from an Excel file. If you do not feel ready to demonstrate these skills by the scheduled day of your exam, you should postpone your exam to a later date, which is completely acceptable (and use the additional time to ask any questions and/or review content). I recommend that you practice these skills on your own prior to the exam.
Oral Exam #2: Oral Exam #2 relates to content up to and including Chapter 8, with a primary focus on skills that were not covered in Oral Exam #1. Please come to Oral Exam #2 prepared to run (in R) and interpret the results of logistic regression models, regression models that model nonlinear data, and regression models that model interactions. This is the primary focus of Oral Exam #2. Questions about causal inference and model specification will also be asked, but they are secondary to the primary focus. You should also review and be prepared to interpret a two-way table. If you do not feel ready to demonstrate these skills by the scheduled day of your exam, you should postpone your exam to a later date, which is completely acceptable (and use the additional time to ask any questions and/or review content). I recommend that you practice these skills on your own prior to the exam.
Oral Exam #3: Oral Exam #3 relates to content up to and including Chapter 12, with a primary focus on skills that were not covered in Oral Exams #1 and #2. Please come to Oral Exam #3 prepared to run (in R) and interpret analyses related to multilevel and/or clustered data (this includes intra-class correlation, fixed effects, and random effects), and interpret the results of ordinal and multinomial logistic regression models. This is the primary focus of Oral Exam #3. A few other questions will also be asked, but they are secondary to the primary focus. If you do not feel ready to demonstrate these skills by the scheduled day of your exam, you should postpone your exam to a later date, which is completely acceptable (and use the additional time to ask any questions and/or review content). I recommend that you practice these skills on your own prior to the exam.
0.4.2 Details and tips about the exams
How should I prepare for an oral exam? What will it test?
- The exam will test your knowledge of and skills related to the quantitative methods we have learned in the chapters corresponding to each exam.
- The exam will test your ability to conduct some but not all of these methods in R.
- You can refer to any materials you would like—course book, your notes and assignments, online sources, anything else—during the exam. Therefore, I suggest that you refresh your memory about where you can find R code that you need, so that you can copy and paste it easily. You can also copy and paste from your homework assignments during the exam. Beyond that, I recommend reviewing the statistical concepts and their interpretations more than R code.
- More weight will be given to your knowledge and interpretation of statistical concepts than your ability to write R code.
What will happen during the exam?
- There is no time limit. The exam will likely take 1–2 hours,4 depending on your level of familiarity with the required skills and content.
- We will ask you a few theoretical questions.
- We will ask you to open a dataset in R and run a few data preparation commands in it.
- We will ask you to conduct a series of statistical tests in R and interpret the results.
- We will tell you after each task/question if you were correct or not. You can re-try some tasks on the spot if you wish. Other items can be flagged for re-testing later.
- If you need to have anything re-tested, we can schedule a separate time for that or you can re-test some topics during a subsequent exam (for example, if you miss any questions on Oral Exam #1, we can re-test you on those during your Oral Exam #2).
- We will try to be efficient during the exam so that we don’t run out of time.
- During the exams, there might occasionally be trick questions. An example of a trick question is: Is water blue or red? This is a trick question because none of the provided answer choices are correct, so the student answering the question must say: Water doesn’t have either of those colors. Water is actually transparent. Trick questions can help us determine if a student understands the broader context of a quantitative analysis scenario or a particular data analysis technique.
Why do we do oral exams in this course?
- Since ours is an asynchronous course in which we do not have regular class meetings, we instructors do not have an opportunity to interact with each of you to make sure that you are comfortable with all of the content and skills. If this were an in-person course with regular class meetings, we would use a flipped classroom model in which you would read this textbook and watch videos before coming to class and then we would do lab-style activities during the class meeting in which you would apply skills in R and interpret results, together as a group. The three Oral Exams in this course are used to replace these in-person interactions and make sure that each student has at least 3–6 hours of interaction with an instructor while doing quantitative analysis during the semester.
How and when do I schedule my oral exams?
- Please e-mail ALL course instructors with times that work for your schedule, a week or two before you want to do each exam.
- The recommended timings of the oral exams are given in the course calendar.
0.5 Final project details and requirements
This section contains specific expectations and requirements for the final project in this class.
These requirements have not yet been updated and might change slightly for spring 2024.
0.5.1 Description
The final project is not meant to be even close to a full quantitative research study. Another way to think about the project is that you will be writing an extended methods section and a condensed results section of an empirical research article, without writing a literature review or anything else.
Note that you can ask for help from instructors as you do the project. It is not like an exam in which you have to do each skill alone. You can send us drafts of your project at intermediate stages and we can give you feedback. We can also meet individually to discuss and advise about your project.
The requirements for the project can be changed for individual students on a case-by-case basis and only with permission from Anshul. If you think that a modification to the goals and/or requirements below may be more productive for your own professional goals and/or your goals in the PhD program, please discuss this with Anshul.
The due date for the final project can be found in the course calendar.
0.5.2 Project goals
The goals of this final project are to…
Present and interpret the results of one regression model that answers a clear and specific research question.
Run, interpret, and appropriately respond to all required diagnostic tests for this regression model and present the results of all tests.
0.5.3 Project requirements
Here are the items you must present and tasks you must complete:
Write a clear research question (RQ) that can be solved using regression analysis techniques. This research question should be a single sentence with a question mark at the end.5
To answer your RQ, relevant concepts will have to be first measured, recorded in a dataset as variables, and then related to each other quantitatively. Present a DAG (directed acyclic graph) that shows the relationship between all of the variables that will be involved in answering your RQ.6 The DAG you present does not need to be aesthetically pleasing.7
Identify a dataset that you will use to answer your research question. Clearly describe the dataset, including: a) population from which the data sample was drawn, b) unit of observation, c) all variables that you will use in your analysis and the unit of measurement of each variable, d) background information about the data.
Given the structure of the data and the RQ of interest, explain which type of regression model is most appropriate to answer your RQ and why. Also identify at least one other type of regression model that could also be used and explain why you instead chose the type of model that you did.
Present basic descriptive statistics that are relevant to your RQ. You should include at least one table and at least one figure/chart. All variables from your DAG must be included in these descriptive statistics.
Show the R code and results of one regression model that answers your RQ.8
Run and present the results of all diagnostic tests that pertain to the type of regression model you ran. Your regression model must pass all of the tests. If your diagnostics show that your model specification violates any of the assumptions of your regression model, you should fix the problem and try again until your model no longer violates the assumptions.9
Interpret the results of your regression model that are relevant to your RQ.
Briefly explain any limitations in your analysis.
Include all R code and results in your final submission.
Present all writing in well-written English.
Present everything in an aesthetically pleasing manner (with the exception of your DAG). It is recommended that you use an RMarkdown document, but this is not required.
0.5.4 Grading rubric
The final project will be graded according to the rubric below. Each criterion is worth a maximum score of two points unless otherwise noted.
Criterion | Score = 0 | Score = 1 | Score = 2 |
---|---|---|---|
Clear RQ | Unclear, more than one sentence, not a question. | Confusingly presented but understandable. | Clear, simply written, single sentence ending with a question mark. |
DAG | Not all relevant variables are shown, items that are not variables in the data are shown, arrows do not make sense, unblocked backdoor paths are present, graph is cyclical. | Minor errors. | DAG clearly shows hypothesized relationship among all variables, including any confounding, mediating, separately-acting, or unmeasurable variables. |
Population and sample | Relationship between sample and population is unclear, details about population is omitted. | Minor omissions, but overall description of the population is understandable. | It is very clear what the population is and how many observations from this population were sampled and then included in the dataset used in the project. |
Unit of observation | The meaning of each row in the data is not understandable from what is written. | Reader can figure out based on context, but a clear explanation is missing. | It is very clear what each row of the data means/represents. This is explicitly stated with no ambiguity or confusion. |
Variables used | The variables used in the analysis are not addressed. | Some variables are mentioned but not all. How each variable is measured is not clear. | Dependent variable and all independent variables are described in one sentence each. Unit of measure (and any relevant explanation of how a variable is coded in the data) is given for each variable. |
Background on data | It is not understandable where the data came from and from what context. | Few details are given about the data. | Clear explanation of where the data came from, when it was collected, who collected it, etc. |
Choose model | No explanation of why the presented regression model was chosen. No comparison to another regression model. Incorrect selection of model type. | An explanation may be there but it might be incorrect, or a comparison to another model is missing. | Logical explanation of the way the data is structured and how the selected regression model is best suited to that data structure. Clear explanation of why at least one alternative regression model was not used. |
Descriptive statistics | No or very few statistics presented. Statistics for irrelevant variables or information are presented. | Descriptive statistics do not cover all variables and observations relevant to the RQ. Only one of two required charts is included. | Descriptive statistics are presented for all variables listed in the DAG and used in the regression model. One well-made figure is presented. One well-made table is presented. |
Regression model | Code and/or summary is not shown for regression. Code does not accomplish the type of regression that was supposed to be used. | Only partial work or result is shown. Type of regression model is unclear. | Correct regression model result is shown along with appropriate R code to execute it. |
Multicollinearity test | Test not presented. | Results do not satisfy assumption. Results presented but interpreted incorrectly. | Regression model passes the VIF test of multicollinearity. A correlation matrix should be presented if necessary, but this is not required. |
Independence assumption | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | The issue of independence of sampled observations is carefully discussed. Non-independent data should be accounted for appropriately in the regression model. |
Other assumption 1 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |
Other assumption 2 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |
Other assumption 3 | Assumption not considered. | Assumption is mentioned but incorrectly interpreted. | Assumption is tested correctly and interpreted correctly. |
Interpret results | Many irrelevant details are given. Research question is not clearly answered. | Research question is answered but interpretation of results is not exactly correct. | Succinct interpretation of the portion of the regression output that pertains to the RQ. |
Limitations | Limitations are not addressed or are completely incorrect given the regression model used. | Limitations are partially addressed. | Multiple plausible limitations to the analysis and the conclusions we can draw from it are addressed. |
R code included | No R code is included | Only partial R code for the results presented is given. | R code is included (displayed in final document) for all results that were generated using R. |
Writing quality (+) | Sentences and paragraphs are not formatted according to convention. Full sentences are not used much or well. | Minor grammar and/or spelling errors occur throughout, but the main points are understandable. | Writing is clear and succinct. It is easy to read quickly and understand the analysis and the results. No grammar or spelling errors. |
Aesthetics | Project is presented in a confusing manner. Order and flow of requested items is not logical. Unnecessary fonts, symbols, and formatting layout appears. | Minor blemishes and errors are visible in the submitted project. | Order of all content is clear and logical. Sections and sub-sections are logically and clearly marked. The write-up is easy to read. |
Items marked with a (+) in the table above will carry more weight than just two points. All other items have a maximum score of two points.
Your grade on the project will be the number of points achieved divided by the total number of points possible.
If you are not satisfied with your grade on this project, you do have the option of taking an INCOMPLETE grade for the course. Then, you will improve and re-submit your project in the weeks that follow the end of the course. We will re-grade the project and then put your improved final grade for the course into the grading system. Please be sure to communicate with course instructors about this option right away, if you think you might choose it.
0.6 Materials, software, and accounts
No purchases are necessary (as long as you already have a computer that can run the necessary software). All necessary materials are either a) free and available online or b) provided to you by the Institute for free.
Texts/Readings: There is no requirement for students to purchase any texts or readings for this class. All materials are available freely online.
Statistical software: All students are required to install R and RStudio on their own computer. R is a free and open-source statistical computing platform and RStudio is a free software that makes R easier to use. Instructions for installing R and RStudio will be provided by the instructor. R and RStudio should run well on computers running Mac, Windows, and Linux operating systems.
Flashcard account: All students are required to make an account in the Adaptive Learner App (or make an alternate arrangement with the instructors) to answer questions presented on at least 15 flashcards per week.
Videoconferencing software: All students will have access Zoom Pro based on enrollment in the PhD-HPEd program at MGHIHP; it should be installed on the student’s computer for group and/or individual meetings with instructors and classmates.
Generative AI account: When used responsibly, generative AI tools such as Chat GPT are very useful for conducting statistical analysis. Throughout this course, we will be recommending that students use Chat GPT for selected portions of the quantitative analysis process. We strongly recommend that all students make an account to use the free version10 of Chat GPT.
Spreadsheet software: Some exercises or data manipulation in this course may need to be done in Excel or a similar spreadsheet software. Students can complete this work in any spreadsheet software such as Microsoft Excel, Google Sheets within Google Drive (free), or Open Office (free).
0.7 Always remember
Below is a list of items that it is critical to always remember, once you have finished learning many of the topics in this textbook.
Every p-value is the result of a hypothesis test. Every time you see a p-value, figure out the null and alternate hypotheses. \(1-p\) is the level of support for the alternate hypothesis. Click here to read more about it.
Magic words to interpret linear regression results: For every one unit increase in \(x\) (an independent variable), \(y\) (dependent variable) is predicted to change on average by \(b_x\) (slope) units, controlling for all other independent variables. DO NOT reverse the positions of X and Y in these words. Read more.
Magic words to interpret odds ratios: For every one unit increase in \(x\) (an independent variable), the odds of \(y\) (event or outcome) are predicted to change on average by \(OR\) (odds ratio) times, controlling for all other independent variables. DO NOT reverse the positions of X and Y in these words. Read more.
All quantitative results are only telling us associations between variables, NOT causal relationships. Read more.
No regression model conclusions can be trusted for inference until the model passes ALL required diagnostic tests of model assumptions.
Use estimated effect sizes and confidence intervals—NOT p-values—to determine if quantitative results are meaningful and relevant. Read about effect size, read about confidence intervals.
Unmeasured confounding,11 selection bias, measurement error,12 and model misspecification are four possible problems that can make quantitative results un-trustworthy. Read more.
You must know the structure and data generating mechanism of your data very well. Otherwise, clustering, non-independence, Simpson’s paradox, and other problems could cause you to draw false conclusions. Read more.
All quantitative analyses have limitations. Be sure to report what you did transparently and responsibly.
0.8 Orientation procedures
Before starting week 1 of this course (meaning chapter 1 in this textbook), we ask that all students complete an orientation procedure, detailed below. You can complete this synchronously during a meeting with an instructor or on your own. If you do this procedure on your own, the videos embedded below might be helpful (although you are not required to watch them).
Make sure you do the following orientation procedure on the same computer that you plan to use during the course.
0.8.1 Basic use
We’ll start with some basics to get everyone set up and navigate within RStudio.
This video summarizes how you can install R and RStudio on your computer:
The video above can be viewed externally at https://youtu.be/6Xc2wuUVDWA.
Task 1: Download and install the appropriate version13 of R from https://cran.rstudio.com/.
You do NOT need to open R, which you just installed above, on your computer. R just needs to be there. We won’t actually use it directly. I know this might be confusing or frustrating. We will use RStudio instead, which you will install next.
Task 2: Download and install the appropriate version of RStudio from https://posit.co/download/rstudio-desktop/.
Now that you have RStudio downloaded and installed on your computer, you can open RStudio (not R, which you installed previously) on your computer and proceed to the next task. We will use RStudio (not R) for all of our work in this course. Please Open RStudio on your computer and proceed to the next task.
The following video will walk you through the first steps of using RStudio (once you have already installed it).
The video above can be viewed externally at https://youtu.be/_FVq-Vyyyfs.
Task 3: Create a new R file.
To create the new R file, click on File
-> New File
-> R Script
.
You can think of an R file as a Microsoft Word file but for data analysis.
Task 4: Save the new R file you just created. You could also create a new folder for all your work in our class, if you want, and save your R file in there.
Task 5: Find your new file within the Files
tab within RStudio.
Task 6: Type 2+3
into a line of your new R file and then click the Run
button. Locate the console and see what happened there. It should look approximately like this:
2+3
## [1] 5
Task 7: In a new line of your R file, type 2*3
and click on Run
. You should see something like this in the console:
2*3
## [1] 6
As you can see, R can help us do basic arithmetic. You could also type 2-3
or 2/3
for subtraction and division. And you can of course change the numbers 2
and 3
to other numbers.
0.8.2 Data loading and viewing
Now it’s time to load some data and look at it together!
This video demonstrates the process of loading and viewing data in R and RStudio:
The video above can be viewed externally at https://youtu.be/A9mQSqbx2hE.
Task 8: Let’s make a new dataset called d
, using the code below.
Type this code into a new line of your R file and run it:
<-mtcars d
Now look in your Environment
tab in RStudio. You should see a new data item called d
.
What we did using the command above is that we created a new data object called d
and we assigned d
to contain the data within mtcars
. mtcars
is a dataset that comes along with R. It is built-in for us to practice with, like we are doing now!
Task 9: You can run the command ?mtcars
to pull up some information about this data, if you would like.
Task 10: Type View(d)
(with a capital V
) into your R code file (in a new line, of course!) and run it.
A data viewer could pop up in a new tab, showing you your dataset in spreadsheet view.
Close the data viewer.
Task 11: Double click on d
in the Environment
tab. This should also bring up the data viewer.
Task 12: Now let’s inspect our data in the data viewer. What are some of your observations?
Task 13: What does each row in the data represent?
Task 14: What does each column in the data represent?
0.8.3 Descriptive statistics
Let’s go back to the R code file tab and do some simple analysis on our data.
In the video below, you can see how to calculate some basic descriptive statistics in R and RStudio.
The video above can be viewed externally at https://youtu.be/SNF01zoJ42I.
Task 15: Type d$
into your code file (in a new line, of course). A list of columns (variables) should pop up. Select mpg
.
Now your line of code should say d$mpg
.
Go ahead and run that line. You should get a result like this:
$mpg d
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Above, the computer gave you all of the data just from the column labeled mpg
from our data.
Task 16: Now let’s calculate the mean of the column (variable) mpg
.
mean(d$mpg)
## [1] 20.09062
Remember, put the command above into a new line of your code file and then run it. And you should see the result above in your console.
Task 17: How about the standard deviation of mpg
? See below.
sd(d$mpg)
## [1] 6.026948
We have now calculated the mean and standard deviation of the mpg
variable.
Task 18: We can also get a summary of the mpg
variable as follows. Please add it to your code file and try it out!
summary(d$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
The output above gives us a number of metrics, although it does not include standard deviation.
Task 19: Try running the mean
, sd
, and summary
commands for a different variable that you choose.
0.8.4 Simple visualizations
We will continue with our brief exploration of our data by making some visualizations.
The video below demonstrates how to make a few simple visualizations—histograms and scatterplots—in R and RStudio. It also shows one possible way to close RStudio on your computer.
The video above can be viewed externally at https://youtu.be/qS3Fh2S7Kd4.
Task 20: Run the command hist(d$mpg)
(again, make sure you’re adding the command in a new line of your code file). You should then see a histogram in RStudio.
hist(d$mpg)
Task 21: And now let’s make a scatterplot using plot(d$wt,d$mpg)
.
plot(d$wt,d$mpg)
Above, we see that the more a car weighs (wt
), the worse its fuel efficiency (mpg
).
Task 22: Try making some other histograms and scatterplots!
0.8.5 Installing and loading packages
Finally, we will practice installing and loading packages to make sure this functionality is working correctly on your computer.
As always, make sure that you put all new lines of code into your R code file.
The following video shows how to install a package and then load that package in R and RStudio:
The video above can be viewed externally at https://youtu.be/66HUNn4-92k.
Task 23: Run the command install.packages("car")
. You should then see a bunch of output in the console. The process might take a while.
Once the process is over, the car
package should be installed on your computer. You can see a list of installed packages in the Packages
tab, if you want. There will be many packages there.
Task 24: Finally, run the command library(car)
to load the car
package into your current R session.
You have reached the end of the R and RStudio setup process! Please continue the orientation procedure below.
0.8.6 Set up Adaptive Learner App
As part of this course, you are required to answer questions on at least 15 flashcards each week (you do not need to get all of the questions correct). The best way to do this is in the Adaptive Learner App,14 which you can see how to set up below.
This video shows the setup process and basic use for the Adaptive Learner App:
This video can also be viewed externally at https://youtu.be/hADmBJ0aD-g
Task 25: Go to https://educ-app-2.vercel.app/ and click on sign up
.
Task 26: Enter your registration information (any email address is fine), review the terms of service and privacy policy, and click sign up
.
Task 27: You should receive an email asking you to verify your email address. The email should come from noreply@educ-app-2.firebaseapp.com
. You might need to search in your junk folder for it. Click the link within the email to verify your account.
Task 28: Click on the tile that says HE902: Advanced Regression Analysis.
Note that each week when you do your flashcards, you do not need to submit a record of completion. The system will remember how many flashcards you did and we (the instructors) will be able to see it on our end.
Now you are ready to use the Adaptive Learner App! Continue the orientation process below.
0.8.7 Set up Chat GPT (optional)
Generative AI tools such as Chat GPT can be useful when doing data analysis, especially to write R code. During this course, we encourage you to experiment with using Chat GPT or a similar tool to write or debug R code. We recommend that you set up a free Chat GPT account at this time if you don’t already have one. It is not necessary to have a paid Chat GPT account or other paid account.
The following video demonstrates how to sign up for the free version of Chat GPT as well as some other example uses.
The video above can be viewed externally at https://youtu.be/0NUkAoO_yi4.
Task 29: Go to http://chat.openai.com
Task 30: Sign into your existing account or make a new free account.
Task 31: Put the following prompt into the Chat GPT chatbot and see what happens – I have a dataframe in R called d. Please write the code to make a scatterplot with the variable wt on the x-axis and the variable mpg on the y-axis. If you want, you can copy-paste the code from Chat GPT into the R file within RStudio that you were using earlier in this orientation process. Run it and see what happens!
0.9 Acknowledgments
The building blocks for this textbook are taken from A Minimal Book Example by Yihui Xie. This work would not be possible without this excellent guide.
Much of the content for this book is influenced by the teaching and research conducted by my colleagues and students.
The various efforts of Roger Edwards, Nicole Danaher-Garcia, Grace Ming, Valay Maskey, Tony Sindelar, Rupali Khadye-Hadshi and all students who have taken the course HE902 at MGHIHP in the past have been particularly instrumental in the development of this course and online textbook.
This is a footnote. You can go back to where you were reading by clicking on the little squiggle arrow right here: ↩︎
This paragraph includes input from experts at the JEDI office and instructional design team at MGHIHP.↩︎
If you prefer not to use the Adaptive Learner App, ask the instructors for an alternative way to complete this course requirement.↩︎
But it could take less than an hour and it could take more than two hours. Both have happened before and are perfectly fine. The goal is not to be fast. The goal is to do everything at a high quality level.↩︎
There are no exceptions to this requirement.↩︎
You SHOULD include variables that you theoretically believe are related to your RQ, even if you don’t have those variables in your dataset.↩︎
You can even submit a photo of a hand-drawn DAG if you would like. All that matters is the content of your DAG, not the presentation.↩︎
In reality, you will likely run many regression models on your own to arrive at the one that fits your data the best. But you do not need to show all of this work in your final submission. If you do wish to show all of this additional work, you can include it in an appendix to your assignment, but this is not required.↩︎
It is common to take advice from an instructor during this process. Feel free to contact us during this or any other part of your final project work.↩︎
It is also fine to use the paid version, but we do not believe this will give you any advantages over the free version, for the purposes of this class.↩︎
Unmeasured confounding can cause omitted variable bias in quantitative results.↩︎
Measurement error is also called information bias.↩︎
This is usually just the latest version.↩︎
Discuss with the instructors if you prefer to do your weekly flashcards without using the app.↩︎