Introductory predictive analytics and machine learning in education and healthcare
Last updated – 05 August 2022
Information and Reference
This textbook accompanies the course HE-930—Statistics/Predictive Analytics for Health Professions Education—in the PhD in HPEd program at MGH Institute of Health Professions. HE-930 is a data analytics course that introduces students to basic predictive analytics (PA) and machine learning (ML), with examples and applications related to education and healthcare. This online textbook is the main resource to guide you through the course HE-930.
Each chapter in this textbook contains reading or videos that for you to consume as well as an assignment that you should complete and submit by the deadline in the course calendar.
My name is Anshul Kumar and I am the author/preparer of this textbook. You can reach me at email@example.com with any questions, comments, and/or suggestions for modifications to this textbook. I co-instruct the course HE-930 with Nicole Danaher-Garcia.
All of the materials here are available online for anybody to use. Those who are not part of the course HE-930 are welcome to use this textbook if it is useful. Please e-mail me any feedback you have.
I use a lot of footnotes like the one after this sentence.1 You can read the footnote by clicking on the small-but-tall number in between this sentence and the previous one. Footnotes contain comments from me or extra information that might be helpful. But footnotes in this textbook are never required for you to read. It is fine for you to skip the footnotes and not read them.
Please note that many of the examples of data and research that you will encounter in this textbook use a binary, inappropriately narrow, and/or potentially problematic conceptualization of sex, gender, and other individual-level characteristics. My personal view is that this is often not the best way to organize data or present examples. Furthermore, our collective understanding of sex, gender, race, ethnicity, and other individual characteristics is constantly changing, and it is critical for all of us as data analysts and researchers to engage in related discussions and initiatives. I am always looking for materials that represent a more inclusive framework and I do update materials when possible. I welcome your input and suggested alternatives.2
0.1 Course calendar
The calendar below shows when assignments are due and when other tasks must be completed. Keep in mind that this calendar might change during the semester.
Each week, please read and follow all instructions in the corresponding chapter. Then complete the assignment at the end of the chapter and submit it in the appropriate location within D2L.
- DP = Discussion post assignment
- R = Data analysis assignment in R
- JC = Journal club will be held this week (approximately). Some dates have already been decided for JC meetings, in which case the day and time will appear in the calendar, and you should have received a calendar invitation to your email for that meeting already.
|Week of||Week/Chapter||Assignment and Activities|
|May 16||1||DP: Introduction, examples of PA research, basic definitions|
|May 23||2||DP: Data types and structures, pre-processing commands/algorithms|
|May 31||3||Attend synchronous seminar sessions (approximately 2 hours per day for 3 days)|
|Jun 6||4||R: Pre-processing and visualization|
|Jun 13||5||DP: Unsupervised machine learning. JC Fri Jun 17, noon ET.|
|Jun 20||6||R: Unsupervised machine learning|
|Jun 27||7||DP: Linear and logistic regression for PA.|
|Jul 5||8||R: Linear and logistic regression for PA. JC Tue Jul 5, 2p ET.|
|Jul 11||9||Short presentation: Classification methods; ethics of ML. JC, Mon Jul 11 2022, 7p ET. Schedule oral exam.|
|Jul 18||10||R: Regression and classification methods. Schedule oral exam, if not already scheduled.|
|Jul 25||11||Peer review. Abstract draft. DP: Predictive model evaluation. JC Wed Jul 27, 2p ET. Do oral exam.|
|Aug 1||12||R: Model validation and evaluation techniques. Do oral exam, if not already complete.|
|Aug 8||13||DP: Future of predictive analytics in HPEd. JC Aug 8, 2p ET. Do oral exam, if not already complete.|
|Aug 15||-||Final project must be submitted by Aug 20 for on-time completion.|
All assignments are due on the Sunday at the end of each week. For example, Week 4 of the class starts on Monday June 6 2022; the assignment for that week is due on Sunday June 12 2022.
Students are responsible for scheduling their oral exam on or near the scheduled week in the calendar, by e-mailing all course instructors.
0.3 Learning activities and grading
0.3.1 Interactions and communications with instructors
Office hours: We (the instructors of HE-930) will hold one or more office hours during most weeks of the course, on Zoom. These are optional (not required), informal, group sessions during which students in the class can ask questions to an instructor, discuss class materials or assignments with each other, and work collaboratively on data analysis in R. You should receive calendar invitations to office hour sessions by email (to the same email addresses at which you receive classwide communications from us).
One-on-one meetings: You are welcome to request a one-on-one meeting with an instructor at any time for any reason. A one-on-one meeting is separate from office hours and will be a meeting only between you and an instructor (separate from office hours). Please email both Anshul and Nicole to request a meeting and one of us will meet with you as soon as our schedules permit.
Emails: Please feel free to email us at any time with any questions, concerns, feedback, or anything else. When you email us, please send your message to both Nicole and Anshul.
0.3.2 Description of curriculum
Here are descriptions of the activities you will do in this course (this is a list of everything you will need to do to complete the course successfully):
Weekly assignments: Homework assignments will involve learning about and reflecting upon how machine learning is used currently in education and healthcare, applying/practicing machine learning techniques in R, and exploring the ethical considerations related to using machine learning effectively and responsibly. Some assignments will help you prepare for your final project in this course. Assignments will be posted within this online textbook and students will submit completed assignments in D2L. Note that if you have data of your own that you are interested in analyzing, you can often use your own data instead of the provided data for the weekly assignments. Please discuss this with the instructors as desired. As long as you are adequately practicing the new skills each week, you can use any data of your choosing.
Journal clubs (as participant): In many weeks of the course, we will hold one-hour journal club meetings. You are required to attend at least two3 journal club meetings during the course, as a participant. During journal clubs, we will discuss published articles, literature, and/or videos related to predictive analytics and machine learning.
Journal clubs (as leader): You are required to lead at least one4 journal club meeting during the course. Typically, you will lead a journal club along with one or more of your classmates. Journal club leaders must submit a short journal club leading plan to the instructors at least one week prior to the date of the journal club. This plan should be less than one page and should a) articulate goals of what to review, discuss, or reinforce during the session and b) list specific discussion questions or prompts to be used to guide the journal club meeting.
Oral exam: You are required to complete one5 oral exam in HE930. This exam will be a one-on-one Zoom meeting between a student and an instructor. During the exam, you will be asked to show your ability to interpret and apply the concepts we have learned. Click here to read more details about the oral exam.
Final project: You are required to plan and complete a final project. This is typically—although it doesn’t have to be—a written article containing either a proposal for a machine learning project or an original data analysis that uses machine learning methods. Click here to read more details about the final project.
0.3.3 Motivation for class format (optional)
Reading this section is optional (not required). It explains some of the reasoning behind the way that this course is organized.
Why do we do journal clubs and an oral exam? Since this is an asynchronous course in which we do not have regular mandatory class meetings, we instructors do not have an opportunity to interact with each of you to make sure that you are comfortable with all of the content and skills. If this were an in-person course with regular class meetings, we would use a flipped classroom model in which you would read this textbook and watch videos before coming to class and then we would do lab-style activities or discussions during the class meeting. The journal clubs (especially the week in which you are a leader) and oral exam in this course are used to replace these in-person interactions and make sure that each student has at least some closer interaction with an instructor.
Why do we do a combination of a) discussion posts and journal clubs, and b) data analysis in R in this course? In our PhD program, there are two courses (this one, which is HE-930; and HE-902) related to quantitative analysis. In HE-902, we focus exclusively on implementing traditional statistical analysis methods in R, without spending much time exploring scholarship produced by others or new and creative uses of quantitative methods. To balance out the application-heavy focus of HE-902, in HE-930, we take a more exploratory approach to data analytics and we do have the flexibility to discuss scholarship produced by others, consider the ethics of data analytics, and brainstorm about new applications. Alongside this exploration, we also spend a few weeks running introductory machine learning procedures in R to gain exposure to the analytic methods that we are reading about and discussing.
0.3.4 Grade calculation
We will calculate your grade like this:
|Course requirement||Proportion of final grade|
|Discussion post assignments (7 posts, worth 4% each)||28%|
|Data analysis assignments (5 assignments, worth 4% each)||20%|
|Attend journal club (2 times, worth 3% each)||6%|
|Lead journal club (1 time)||6%|
0.4 Oral exam
In this section, you can find additional information about the oral exam that you are required to take in HE-930.
0.4.1 Key details
The oral exam will include content and skills from the first 10 weeks/chapters in this course. You can do your oral exam in the 11th week or any time after during the semester.
It is your responsibility to schedule your exam by emailing Nicole and Anshul with some times that work for you, at least a week or two before you want to do your exam.6
The focus of the exam is the overall analytics process and evaluating results. The best place to start reviewing is the video made by Anshul from Week 3. You also practiced this process on your own in R in Weeks 8 and 10. Make sure you understand Weeks 3, 8, and 10 before you do your exam.
You will not need to do anything in R on the exam. Instead, we will show you data and results once you tell us what you would do to generate them.
The exam is “open-book,” meaning that you can refer to any notes or course materials during the exam.
You are allowed to re-take the exam as many times as you would like, if you are not satisfied with your initial performance.
Since we are giving most of the questions to you before the exam (below), please be sure that you are ready to answer these questions before the start of your scheduled exam. If you do not feel ready to answer these questions, we recommend that you email both Nicole and Anshul to postpone your exam until you are ready, which is perfectly fine.
0.4.2 Example exam questions
The questions below are all very likely to be asked on the oral exam. Some additional questions might also be asked that do not appear below.
Part 1: Scenario-based questions
- Imagine this scenario: [during the exam, a scenario will be given to you]
- Write a research question that relates to this scenario and could be answered with predictive analytic methods.
- What is the dependent variable?
- How many levels does the dependent variable have, and what are they?
- Is this a classification or regression problem? Why?
- What is the first step(s) you would do to use predictive analytics to answer this question?
- Once the data is ready, what do you do next?
- What are some examples of predictive models that you could use?
- Let’s say we use [a predictive model from previous answer] to train a model. What would you do next?
- What are the columns in the testing data spreadsheet before and after using the trained model?
- Look at the testing data spreadsheet (this will be shown during the exam). Identify observations for which the model made correct and incorrect predictions.
- How would you assess how well the model performed overall, without having to look at the testing data spreadsheet?
- You will be shown an example of this model assessment strategy during the exam, once you answer above. Interpret each number that you see.
- Calculate the accuracy, sensitivity, and specificity of the model.
- Are these results good (useful) or bad (not useful)? How do you know? (In what ways are they good or bad?)
- What could we do next to try to improve our predictions?
- You will be shown results of the improvement and be asked to analyze whether the improvement was effective or not.
- Once you find a model that is satisfactory, how would you go about using it in practice, within the context of the scenario? Describe the logistical and analytic steps in detail.
Part 2: Additional questions
- What is the difference between supervised and unsupervised machine learning?
- Which one did you use in the scenario above?
- What could we (for example) learn about the data in the scenario by using the other type of machine learning (that we did not use in the scenario)?
0.5 Final project
0.5.1 Basic information
- All students in HE930 must complete a final project.
- The due date for the final project is in the course calendar.
- You are welcome to submit all or some parts of your final project before the due date, receive feedback, and make changes before your final submission.
- You might also have chances to make changes to your project after the due date, on a case by case basis.
- There are three options for the final project: data analysis project, proposal project, customized project. Each option is described below.
- It might be possible to use your project to write a portion of your prospectus or dissertation for the PhD program. This would have to first be discussed among you (the PhD student), the HE930 instructors, and your academic advisor and/or dissertation committee chair.7
0.5.2 Data analysis project option
This project features an original data analysis that uses predictive methods. The goal of this project is to practice running predictive models on already-collected data and interpreting the results. Most of the work will be spent on preprocessing, analyzing, and running predictive models on raw data.8 This project option is about data analysis, not about writing a large amount.
Approximate steps for data analysis project option:
- Identify a research question (RQ). This question must be approved by instructors before you start working on the project.
- Explain why this project and RQ are useful (5-10 sentences).
- Identify a dataset with which you can answer your RQ.
- Present basic descriptive statistics that are relevant to your RQ.
- Create and evaluate the results of at least two machine learning models to address your RQ.
- Use a reasonable form of cross-validation to further evaluate your results. Ask instructors for tips about this if you want.
Rubric and requirements for data analysis project option:
- Write RQ(s) (single sentence with a question mark at the end, for each RQ)
- Explain need for this analysis (2-3 sentences)
- Any other contextual details you want to add (optional)
Methods section (medium size; not as elaborate as for publication)
- Brief description of dataset (2-4 sentences)
- Detailed description of DV (2-4 sentences)
- Summary of IVs. You don’t need to explain every single one. Just explain what the IVs are capturing/measuring overall. This can be in table or bullet-list form. (2-4 sentences or bullets)
- Summary of predictive methods attempted. Two models must be attempted. You do not need to explain how exactly each model works. Instead, focus on why the models you chose might be appropriate for your RQ. (2-4 sentences per model)
- Present at least two predictive models and their results. All R code must be included.
- Additional predictive models can be run but full details about them are not necessary to present.
- Unsupervised machine learning methods can be used, but only with prior approval from an instructor.
- For each of the two models presented, run the models as follows:
- single random training/testing split (no cross validation)
- 10-fold cross-validation9 (these results are the main focus of the project, not the single random training/testing split)
- For each of the two models presented, include and comment on the following metrics for the cross-validated results only:10
- Confusion matrix
- F1 (also called F-statistic)
- ROC curve
- Other situation-specific relevant metrics to help you decide which model is best (optional)
Results – based entirely on cross-validated results only
- Interpret results, briefly
- Compare the two (or more) models to each other using the cross-validated results only and select which one is best, using the most relevant/meaningful metrics (not necessarily just accuracy) (2 sentences)
- Provide commentary about the predictions made by the one best model
- Overall usefulness of the model (2 sentences)
- Present and comment on predictor variables (IVs) that are most important for making the predictions (variable importance) (1 sentence)
- Include at least four example observations from your raw data, their actual outcomes, their predicted outcomes, and your commentary on those predictions (1 sentence per example; include at least one example of a false positive, false negative, true positive, and true positive prediction).
- Policy or program change implications of the predictions or findings (3-5 sentences)
Organization and presentation
- Even though writing a lot of text is not a focus of this project, all of the text, code, and results that you do present must be presented in an organized and aesthetically appropriate manner.
End of rubric for data analysis project option.
0.5.3 Project proposal project option
This project proposes but does not conduct a plan for using analytic methods to solve a problem in HPEd or answer a RQ relevant to HPEd and/or your work. The goal of this project is to propose a question that can be answered with predictive modeling, review relevant literature on both the RQ and the applications of predictive models to similar questions, and write a detailed methods section. This project option focuses on research design and writing publication-quality (or grant proposal quality) introduction, literature review, and methods sections. You must fully understand the analytic methods that you propose to use (including the ethical and policy implications of doing so) but you do not need to execute them.
Approximate steps for project proposal project option:
- Identify a research question (RQ). This question must be approved by instructors before you start working on the project.
- Write a detailed introduction/background section which will include: why this project and RQ are useful, policy implications, ethical considerations (both good and bad), what new contribution your study makes to existing peer-reviewed scholarship, and other contextual or relevant details.
- Conduct a thorough literature review about a) what is known about your RQ and b) the application of your proposed methods to other (but ideally similar) RQs.
- Thoroughly present (but do not actually conduct) all research methods that will allow you to answer your RQ.
Rubric and requirements for project proposal project option:
Introduction (approximately 500 words)
- Write RQ(s) (single sentence with a question mark at the end, for each RQ)
- Justify utility of RQ
- Provide relevant background details
- Address policy implications. Make a case for the practical applied use or another useful application of the results of your proposed study.
- Address ethical considerations, both good and bad.
- Highlight the new and unique contribution to scholarship/knowledge that your proposed work would make.
- Any other contextual or relevant details.
- This section must demonstrate that you know what you’re doing, meaning that you truly understand how and why predictive analytic methods are relevant to your research question.
- If you are writing this as a grant proposal, you may also need to address additional requirements specified by the funding entity.
Background and literature review (approximately 1000 words)
- Describe the current state of knowledge about your RQ, using a combination of:
- Citing, summarizing, and synthesizing existing peer-reviewed scholarship. Cite at least 12 relevant publications.
- Commenting on existing scholarship or lack thereof
- Provide examples and precedents for the application of your proposed methods to research questions that are similar to yours.
- Cite at least 5 relevant publications or sources (ideally from peer-reviewed journals or textbooks that have been adequately vetted)
- Explain how you will use similar methods as what you cited, but you will apply them to a different RQ (than the RQ in what you cited).
- Explain differences between cited uses of your methods and what you plan to do. This can go in the methods section if you prefer.
Methods section (approximately 1000 words)
- Provide all necessary details that a reader would need to know in order to replicate the project you are proposing.
- The methods you propose must involve the analysis of raw data using predictive analytic methods we use in our class (or others, with instructor approval)
- These details should include (but not be limited to only) the following:
- Data collection/acquisition/generating procedure
- Characteristics of dataset
- Detailed description of dependent variable(s) involved
- Detailed description of independent variables involved (or of each group or type of independent variable, if there are many)
- Data preparation and preprocessing steps
- Model selection, building, and creation process
- Results evaluation process, including commentary on which methods/metrics are likely to be most useful in choosing a good model, and why.
- Commentary on possible result outcomes and how you would then apply them.
- References to specific analytic tools (software, packages, anything else) you plan to use, and details on how those tools might be used.
Organization and presentation
- All writing and formatting must be of publication quality. Alternatively, if you wish, you could use this project to prepare a grant proposal submission. Please show the grant submission template/materials to an instructor—well in advance of writing the project, of course)—and we will adapt the rubric above such that it is appropriate for the grant submission.
- Any figures or charts must be embedded within the text, be well-polished, and be logical to understand.
- Your submission needs to look polished and professional.
End of rubric for project proposal project option.
0.5.4 Customized project option
You are welcome to propose a customized option for your final project to the instructors. Ideally, the project will be useful to you in your work beyond this course alone, which is why this customized option is also available. If you want to explore doing a customized project, please contact all instructors to discuss this. If your project is approved, we will work together to develop a unique rubric for your customized project and you will then be evaluated according to that rubric’s co-created and agreed-upon requirements. Note that if you do a customized project, to maintain equity across all students, you will be required to do an equivalent amount of work as your classmates who do the other project options. This requirement will be reflected in the co-created rubric.
0.6 Required materials
No purchases are necessary for this course (as long as you already have a computer that can run the necessary software). All necessary materials are either a) free and available online or b) provided to you by the Institute for free.
Texts/Readings: There is no requirement for students to purchase any texts or readings for this class. All materials are available freely online.
Statistical software: All students are required to install R and RStudio on their own computer. R is a free and open-source statistical computing platform and RStudio is a free software that makes R easier to use. Instructions for installing R and RStudio are available at https://bookdown.org/anshul302/HE902-MGHIHP-Spring2020/#orientation. You can also ask the instructors for assistance with this. R and RStudio should run well on computers running Mac, Windows, and Linux operating systems.
Videoconferencing software: All students will have access Zoom Pro based on enrollment in the PhD-HPEd program at MGHIHP; it should be installed on the student’s computer for group and/or individual meetings with the instructors and classmates.
Spreadsheet software: Some exercises or data manipulation in this course may need to be done in Excel or a similar spreadsheet software. Students can complete this work in any spreadsheet software such as Microsoft Excel, Google Sheets within Google Drive (free), or Open Office (free).
This textbook is made using the Bookdown platform and is only possible as a result of the incredible work of Yihui Xie. For more information, see https://bookdown.org/.
Much of the content for this book is influenced by the teaching and research conducted by my coworkers and students.
The various efforts of Roger Edwards, Nicole Danaher-Garcia, Grace Ming, Valay Maskey, Tony Sindelar, Alper Bayazit, and all students who have taken the course HE930 at MGHIHP in the past have been instrumental in the development of this course and online textbook.
This is a footnote. You can click on the arrow to go back to where you were, if you are viewing this in a web browser.↩︎
This paragraph includes input from experts at the JEDI office and instructional design team at MGHIHP, who were open to engaging in discussions about this topic.↩︎
You must attend two journal club meetings not including the meeting that you lead. So, you will end up attending at least three in total.↩︎
This is in addition to the two times (at least) that you will attend journal club as a non-leader participant. You are required to attend a total of three journal clubs: one time as a leader and two times as a participant.↩︎
The course HE902 has three oral exams. HE930 just has one.↩︎
If you want to get your exam dates on the calendar even earlier, just e-mail the instructors and we can definitely do that.↩︎
You must write a new portion of your prospectus or dissertation that is not already written. We can potentially be flexible with this, as long as you start by transparently sharing what you have already written.↩︎
Raw data just means data in the spreadsheet format that we use in our class data analysis assignments.↩︎
This can be modified after discussion with an instructor, especially in cases with very small or very large sample sizes. Very small sample sizes can use leave-one-out cross validation (LOOCV), where the number of folds is the same as the number of observations.↩︎
None of these should be manually calculated. They can all be calculated by the computer for you, as demonstrated in the data analysis assignments.↩︎