Information and Reference

This textbook accompanies the course HE-930—Statistics/Predictive Analytics for Health Professions Education—in the PhD in HPEd program at MGH Institute of Health Professions. HE-930 is a data analytics course that introduces students to basic predictive analytics (PA) and machine learning (ML), with examples and applications related to education and healthcare. This online textbook is the main resource to guide you through the course HE-930.
My name is Anshul Kumar and I am the author/preparer of this textbook. You can reach me at akumar@mghihp.edu with any questions, comments, and/or suggestions for modifications to this textbook. I co-instruct the course HE-930 with Nicole Danaher-Garcia.
All of the materials here are available online for anybody to use. Those who are not part of the course HE-930 are welcome to use this textbook if it is useful. Please e-mail me any feedback you have.
I use a lot of footnotes like the one after this sentence.¹ You can read the footnote by clicking on the small-but-tall number in between this sentence and the previous one. Footnotes contain comments from me or extra information that might be helpful. But footnotes in this textbook are never required for you to read. It is fine for you to skip the footnotes and not read them.
Please note that many of the examples of data and research that you will encounter in this textbook use a binary, inappropriately narrow, and/or potentially problematic conceptualization of sex, gender, and other individual-level characteristics. My personal view is that this is often not the best way to organize data or present examples. Furthermore, our collective understanding of sex, gender, race, ethnicity, and other individual characteristics is constantly changing, and it is critical for all of us as data analysts and researchers to engage in related discussions and initiatives. I am always looking for materials that represent a more inclusive framework and I do update materials when possible. I welcome your input and suggested alternatives.²
At IHP, we recognize that students who observe religious and cultural holidays may not be able to complete their assignments or study for exams during these holidays. We will do our best to be flexible and accommodate your schedule related to your observation of holidays. Please just let us know about your expected schedule.

0.1 Course calendar, summer 2024

Everything you need to do in each week of this course can be found a) in the calendar below and b) in the chapter of this textbook corresponding to each week. Keep in mind that this calendar might change during the semester.

Each week starts on a Monday and ends on a Sunday.

Week of	Week/Chapter #	Assignment/Activities
May 13	1	Weekly assignment. 15 Flashcards.
May 20	2	Weekly assignment. 15 Flashcards. Journal club: Thu 5/23 4p ET.
May 27	3	Attend seminar sessions. 15 Flashcards.
Jun 3	4	Weekly assignment. 15 Flashcards.
Jun 10	5	Weekly assignment. 15 Flashcards. Journal club: Thu 6/13 3p ET.
Jun 17	6	Weekly assignment. 15 Flashcards. Journal club: Thu 6/20 1p ET.
Jun 24	7	Weekly assignment. 15 Flashcards.
Jul 1	8	Weekly assignment. 15 Flashcards.
Jul 8	9	Weekly assignment. 15 Flashcards. Schedule oral exam.
Jul 15	10	Weekly assignment. 15 Flashcards. Schedule oral exam, if not already scheduled.
Jul 22	11	Weekly assignment. 15 Flashcards.
Jul 29	12	Weekly assignment. 15 Flashcards. Do oral exam.
Aug 5	13	Weekly assignment. 15 Flashcards. Do oral exam, if not already complete.
Aug 12	None	Final project is due Aug 15

Optional clarifications:

All assignments are due on the Sunday at the end of each week. For example, Week 4 of the class starts on Monday June 3 2024; the assignment for that week is due on Sunday June 9 2024.
Students are responsible for scheduling their oral exam on or near the scheduled week in the calendar, by e-mailing all course instructors.

0.2 Useful links

Link to this HE-930 course textbook: https://bookdown.org/anshul302/paml/
D2L course website for HE-930 in summer 2024: https://mghinstitute.desire2learn.com/d2l/home/227377
Resources related to R and RStudio
- Responsible applied statistics in R for behavioral and health data (textbook used in HE-902 at MGHIHP) – https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/
- Appendix in Responsible Statistics textbook, containing many tips and tricks: https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/appendix-1-selected-additional-r-code-and-resources.html

MGHIHP Academic Calendar: https://www.mghihp.edu/academics/academic-resources-services/academic-calendars

0.3 Grading, Assignments, and Curriculum

0.3.1 All course requirements and grade calculation

Below is a list of all course requirements and how your final grade will be calculated.

Course requirement	Proportion of final grade
Discussion assignments (at least 7 posts/messages, worth 2.5% each)	17.5%
Oral exam	15%
Weekly flashcard practice (15 flashcards per week)	15%
Data analysis assignments (6 assignments, worth 3.5% each)	21%
Attend journal club (2 times, worth 3% each)	6%
Lead journal club (1 time)	5.5%
Final project	20%

0.3.2 Interactions and communications with instructors

Office hours: We (the instructors of HE-930) will hold one or more office hours during most weeks of the course, on Zoom. These are optional (not required), informal, group sessions during which students in the class can ask questions to an instructor, discuss class materials or assignments with each other, and work collaboratively on data analysis in R. You should receive calendar invitations to office hour sessions by email (to the same email addresses at which you receive classwide communications from us).
One-on-one meetings: You are welcome to request a one-on-one meeting with an instructor at any time for any reason. A one-on-one meeting is separate from office hours and will be a meeting only between you and an instructor (separate from office hours). Please email both Anshul and Nicole to request a meeting and one of us will meet with you as soon as our schedules permit.
Emails: Please feel free to email us at any time with any questions, concerns, feedback, or anything else. When you email us, please send your message to both Nicole and Anshul.

0.3.3 Description of curriculum

Here are descriptions of the activities you will do in this course (this is a list of everything you will need to do to complete the course successfully):

Weekly assignments: Homework assignments will involve learning about and reflecting upon how machine learning is used currently in education and healthcare, applying/practicing machine learning techniques in R, and exploring the ethical considerations related to using machine learning effectively and responsibly. Some assignments will help you prepare for your final project in this course. Assignments will be posted within this online textbook and students will submit completed assignments in D2L or wherever else it is specified. Note that if you have data of your own that you are interested in analyzing, you can often use your own data instead of the provided data for the weekly assignments. Please discuss this with the instructors as desired. As long as you are adequately practicing the new skills each week, you can use any data of your choosing. The purpose of weekly assignments is to discuss or practice skills. Coding assignments are mostly graded based on your completion of the work. An instructor will look at your submitted work for each assignment, but might not give you any feedback on it if you have fulfilled the requirements. If you ask a question or we notice any omissions or errors, then we will give you feedback accordingly.
Journal clubs (as participant): In many weeks of the course, we will hold one-hour journal club meetings. You are required to attend at least two³ journal club meetings during the course, as a participant. During journal clubs, we will discuss published articles, literature, and/or videos related to predictive analytics and machine learning.
Journal clubs (as leader): You are required to lead at least one⁴ journal club meeting during the course. Typically, you will lead a journal club along with one or more of your classmates. Journal club leaders must submit a short journal club leading plan to the instructors at least one week prior to the date of the journal club. This plan should be less than one page and should a) articulate goals of what to review, discuss, or reinforce during the session and b) list specific discussion questions or prompts to be used to guide the journal club meeting.
Oral exam: You are required to complete one⁵ oral exam in HE930. This exam will be a one-on-one Zoom meeting between a student and an instructor. During the exam, you will be asked to show your ability to interpret and apply the concepts we have learned. Click here to read more details about the oral exam.
Weekly flashcard practice: If you took the class HE-902 in spring 2024, you should already have an account for the Adaptive Learner App. If you do not already have an account, please refer to the “Set up Adaptive Learner App” section at https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/#set-up-adaptive-learner-app to make an account.⁶ This App has a section specifically for students in HE-930. The App contains flashcards with practice quiz questions for you to answer each week. You are required to answer at least 15 flashcard questions each week (with the option of answering more if you wish). This requirement might change during the course. The goals of this formative assessment activity are to a) build and retain knowledge about quantitative analysis topics and b) identify content and skill areas that require additional practice or attention. Please provide any feedback on this learning activity and associated technology to Anshul at akumar@mghihp.edu .

Final project: You are required to plan and complete a final project. This is an original data analysis project that uses machine learning methods. Click here to read more details about the final project.

0.3.4 Motivation for class format (optional)

Reading this section is optional (not required). It explains some of the reasoning behind the way that this course is organized.

Why do we do journal clubs and an oral exam? Since this is an asynchronous course in which we do not have regular mandatory class meetings, we instructors do not have an opportunity to interact with each of you to make sure that you are comfortable with all of the content and skills. If this were an in-person course with regular class meetings, we would use a flipped classroom model in which you would read this textbook and watch videos before coming to class and then we would do lab-style activities or discussions during the class meeting, during which we could all interact. The journal clubs (especially the week in which you are a leader) and oral exam in this course are used to replace these in-person interactions and make sure that each student has at least some synchronous interaction with an instructor.
Why do we do a combination of a) discussions and journal clubs, and b) data analysis in R in this course? In our PhD program, there are two courses (this one, which is HE-930; and HE-902) related to quantitative analysis. In HE-902, we focus exclusively on implementing traditional statistical analysis methods in R, without spending much time exploring scholarship produced by others or new and creative uses of quantitative methods. To balance out the application-heavy focus of HE-902, in HE-930, we take a more exploratory approach to data analytics and we do have the flexibility to discuss scholarship produced by others, consider the ethics of data analytics, and brainstorm about new applications. Alongside this exploration, we also spend a few weeks running introductory machine learning procedures in R to gain exposure to the analytic methods that we are reading about and discussing.

0.4 Oral exam

In this section, you can find additional information about the oral exam that you are required to take in HE-930.

0.4.1 Key details

The oral exam will include content and skills from the first 10 weeks/chapters in this course. You can do your oral exam in the 11th week or any time after that.
It is your responsibility to schedule your exam by emailing Nicole and Anshul with some times that work for you, at least a week or two before you want to do your exam.⁷
The focus of the exam is the overall analytics process and evaluating results. The best place to start reviewing is the video made by Anshul from Week 3. You also practiced this process on your own in R in Weeks 8 and 10. Make sure you are comfortable with Weeks 3, 8, and 10 before you do your exam.
You will not need to do anything in R on the exam. Instead, we will show you data and results, after you tell us what you would do to generate them.
The exam is “open-book,” meaning that you can refer to any notes or course materials during the exam.
You are allowed to re-take the exam multiple times, if you are not satisfied with your initial performance.
Since we are giving most of the questions to you before the exam (below), please be sure that you are ready to answer these questions before the start of your scheduled exam. If you do not feel ready to answer these questions, we recommend that you email both Nicole and Anshul to postpone your exam until you are ready, which is perfectly fine. You can also meet with us to ask any questions as you prepare.

0.4.2 Example exam questions

The questions below are all very likely to be asked on the oral exam. Some additional questions might also be asked that do not appear below.

Part 1: Scenario-based questions

Imagine this scenario: [during the exam, a scenario will be given to you]
Write a research question that relates to this scenario and could be answered with predictive analytic methods.
What is the dependent variable?
How many levels does the dependent variable have, and what are they?
Is this a classification or regression problem? Why?
What is the first step(s) you would do to use predictive analytics to answer this question?
Once the data is ready, what do you do next?
What are some examples of predictive models that you could use?
Let’s say we use [a predictive model from previous answer] to train a model. What would you do next?
What are the columns in the testing data spreadsheet before and after using the trained model?
Look at the testing data spreadsheet (this will be shown during the exam). Identify observations for which the model made correct and incorrect predictions.
How would you assess how well the model performed overall, without having to look at the testing data spreadsheet?
You will be shown an example of this model assessment strategy during the exam, once you answer above. Interpret each number that you see.
Calculate the accuracy, sensitivity, and specificity of the model.
Are these results good (useful) or bad (not useful)? How do you know? (In what ways are they good or bad?)
What could we do next to try to improve our predictions?
You will be shown results of the improvement and be asked to analyze whether the improvement was effective or not.
Once you find a model that is satisfactory, how would you go about using it in practice, within the context of the scenario?

Part 2: Additional questions

What is the difference between supervised and unsupervised machine learning?
Which one did you use in the scenario above?
What could we (for example) learn about the data in the scenario by using the other type of machine learning (that we did not use in the scenario)?

0.5 Final project

0.5.1 Basic information

All students in HE930 must complete a final project.
The due date for the final project is in the course calendar.
You are welcome to submit all or some parts of your final project before the due date, receive feedback, and make changes before your final submission.
You might also have chances to make changes to your project after the due date, on a case by case basis.

0.5.2 Final project details

This project features an original data analysis that uses predictive methods. The goal of this project is to practice running predictive models on already-collected data and interpreting the results. Most of the work will be spent on preprocessing, analyzing, and running predictive models on raw data.⁸ This project option is about data analysis, not about writing a large amount.

Approximate steps for completing the project

Identify a research question (RQ). This question must be approved by instructors before you start working on the project.
Explain why this project and RQ are useful (5-10 sentences).
Identify a dataset with which you can answer your RQ.
Present basic descriptive statistics that are relevant to your RQ.
Create and evaluate the results of at least two machine learning models to address your RQ.
Use a reasonable form of cross-validation to further evaluate your results. Ask instructors for tips about this if you want.

Rubric and exact requirements

Introduction

Write RQ(s) (single sentence with a question mark at the end, for each RQ)
Explain need for this analysis (2-3 sentences)
Any other contextual details you want to add (optional)

Methods section (medium size; not as elaborate as for publication)

Brief description of dataset (2-4 sentences)
Detailed description of DV (2-4 sentences)
Summary of IVs. You don’t need to explain every single one. Just explain what the IVs are capturing/measuring overall. This can be in table or bullet-list form. (2-4 sentences or bullets)
Summary of predictive methods attempted. Two models must be attempted. You do not need to explain how exactly each model works. Instead, focus on why the models you chose might be appropriate for your RQ. (2-4 sentences per model)

Data analysis

Present at least two predictive models and their results. All R code must be included.
Additional predictive models can be run but full details about them are not necessary to present.
Unsupervised machine learning methods can be used, but only with prior approval from an instructor.
For each of the two models presented, run the models as follows:
- single random training/testing split (no cross validation)
- 10-fold cross-validation⁹ (these results are the main focus of the project, not the single random training/testing split)
For each of the two models presented, include and comment on the following metrics for the cross-validated results only:¹⁰
- Confusion matrix
- Specificity
- Sensitivity
- Accuracy
- F1 (also called F-statistic)
- ROC curve
- Other situation-specific relevant metrics to help you decide which model is best (optional)

Results – based entirely on cross-validated results only

Interpret results, briefly
Compare the two (or more) models to each other using the cross-validated results only and select which one is best, using the most relevant/meaningful metrics (not necessarily just accuracy) (2 sentences)
Provide commentary about the predictions made by the one best model
Overall usefulness of the model (2 sentences)
Present and comment on predictor variables (IVs) that are most important for making the predictions (variable importance) (1 sentence)
Include at least four example observations from your raw data, their actual outcomes, their predicted outcomes, and your commentary on those predictions (1 sentence per example; include at least one example of a false positive, false negative, true positive, and true positive prediction).
Policy or program change implications of the predictions or findings (3-5 sentences)

Organization and presentation

Even though writing a lot of text is not a focus of this project, all of the text, code, and results that you do present must be presented in an organized and aesthetically appropriate manner.

0.6 Materials, software, and accounts

No purchases are necessary for this course (as long as you already have a computer that can run the necessary software). All necessary materials are either a) free and available online or b) provided to you by the Institute for free.

Texts/Readings: There is no requirement for students to purchase any texts or readings for this class. All materials are available freely online.
Statistical software: All students are required to install R and RStudio on their own computer. R is a free and open-source statistical computing platform and RStudio is a free software that makes R easier to use. Instructions for installing R and RStudio are available at https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/#orientation. You can also ask the instructors for assistance with this. R and RStudio should run well on computers running Mac, Windows, and Linux operating systems.
Flashcard account: All students are required to make an account in the Adaptive Learner App (or make an alternate arrangement with the instructors) to answer questions presented on at least 15 flashcards per week.
Videoconferencing software: All students will have access Zoom Pro based on enrollment in the PhD-HPEd program at MGHIHP; it should be installed on the student’s computer for group and/or individual meetings with the instructors and classmates.
Spreadsheet software: Some exercises or data manipulation in this course may need to be done in Excel or a similar spreadsheet software. Students can complete this work in any spreadsheet software such as Microsoft Excel, Google Sheets within Google Drive (free), or Open Office (free).

0.7 Acknowledgments

This textbook is made using the Bookdown platform and is only possible as a result of the incredible work of Yihui Xie. For more information, see https://bookdown.org/.
Much of the content for this book is influenced by the teaching and research conducted by my coworkers and students.
The various efforts of Roger Edwards, Nicole Danaher-Garcia, Grace Ming, Valay Maskey, Tony Sindelar, Alper Bayazit, Rupali Khadye-Hadshi, and all students who have taken the course HE930 at MGHIHP in the past have been instrumental in the development of this course and online textbook.

This is a footnote. You can click on the arrow to go back to where you were, if you are viewing this in a web browser.↩︎
This paragraph includes input from experts at the JEDI office and instructional design team at MGHIHP, who were open to engaging in discussions about this topic.↩︎
You must attend two journal club meetings not including the meeting that you lead. So, you will end up attending at least three in total.↩︎
This is in addition to the two times (at least) that you will attend journal club as a non-leader participant. You are required to attend a total of three journal clubs: one time as a leader and two times as a participant.↩︎
The course HE902 had three oral exams in 2024. HE930 just has one.↩︎
If you prefer not to use the Adaptive Learner App, ask the instructors for an alternative way to complete this course requirement.↩︎
If you want to get your exam dates on the calendar even earlier, just e-mail the instructors and we can definitely do that.↩︎
Raw data just means data in the spreadsheet format that we use in our class data analysis assignments.↩︎
This can be modified after discussion with an instructor, especially in cases with very small or very large sample sizes. Very small sample sizes can use leave-one-out cross validation (LOOCV), where the number of folds is the same as the number of observations.↩︎
None of these should be manually calculated. They can all be calculated by the computer for you, as demonstrated in the data analysis assignments.↩︎

Introductory predictive analytics and machine learning in education and healthcare