Chapter 7 Jun 26–July 2: Linear and logistic regression for predictive analytics

This week, our goals are to…

  1. Grasp the similarities between traditional and PA (predictive analytics) uses of regression, especially the separation of training and testing data, training a model with training data, and then plugging testing data into the regression equation.

  2. Recognize how linear and logistic regression can be used for supervised machine learning.

  3. Brainstorm about final project research question and methods.

  4. Evaluate policy implications of PA research.

  5. Create a basic dashboard in R with multiple charts, tables, and pages using the flexdashboard package

7.1 Regression and classification in supervised machine learning

This chapter (and the majority of the rest of this textbook) is entirely related to supervised machine learning. Within supervised machine learning, we typically need to conduct either regression or classification, depending on what exactly we are trying to predict.

What is the difference between regression and classification?

  • Regression refers to a predictive model that predicts a numeric variable. For example, if you want to predict a student’s test score or how many grams of ice cream a person eats each day, you will need to use regression.

  • Classification refers to a predictive model that predicts a categorical variable. For example, if you want to predict whether a student will be admitted or not to an educational program or a person’s favorite ice cream flavor, you will need to use classification.

  • Both regression and classification are supervised machine learning methods.

This reference table further points out differences and similarities:

Type of model Dependent variable Dependent variable examples Methods examples
regression numeric test scores, stock prices, how fast somebody can run a mile, how many potatoes somebody will purchase at the store linear regression, k-nearest neighbors regression, random forest regression, neural network regression, many others
classification categorical favorite ice cream flavor, admissions decision (yes or no), whether or not someone buys a product (yes or no), political party. logistic regression, k-nearest neighbors classification, random forest classification, neural network classification, many others

Notes:

  • Many types of machine learning work for both regression and classification. That’s why many methods appear in both the regression and classification sections of the table above.
  • Confusingly, even though logistic regression has the word “regression” in its name, it is used for classification and not regression, in this context.

7.2 Using linear and logistic regression for predictive analytics

Please watch the following videos which demonstrate how regression methods can be used to conduct supervised machine learning:

Linear Regression in R. Linear Regression in R With Example. Data Science Algorithms. Simplilearn. https://www.youtube.com/watch?v=2Sb1Gvo5si8. Watch at the YouTube link or embedded below.

Logistic Regression in R. Logistic Regression in R Example. Data Science Algorithms. Simplilearn. https://www.youtube.com/watch?v=XycruVLySDg. Watch at the YouTube link or embedded below.

We will now turn our attention to some examples in which linear and logistic regressions have been applied.

7.3 Scholarship involving linear and logistic regression

Below, we will see an example in which linear and logistic regression have been used for supervised machine learning.

Please read the following article:

7.4 Applying machine learning results in policy and the real world

Please read the following regarding how machine learning can be practically used to influence policy actions or decisions:

  • Read pp. 12–19, Section 4, “CASE STUDY – PREDICTING HYGIENE VIOLATIONS” as well as pp. 21–23, Section 5.3, “Examples from Public Policy Making” in: Steuer F. 2018. Machine Learning for Public Policy Making. How to Use DataDriven Predictive Modeling for the Social Good. Student Paper Series 46. https://www.ibei.org/ibei_studentpaper46_162056.pdf

In the reading above, please pay extra attention to the two health policy rows within table 5.1 on p. 23 (the examples about lead poisoning and heart failure).

These articles about how artificial intelligence can be used in education are optional (not required) to read:

7.5 Dashboards in R with flexdashboard

In this part of the chapter, we will learn how to make a very basic data dashboard in R using the flexdashboard package. Please have a look at the following dashboard examples that other people have made, if you have not done so already:

You already have experience with R Markdown files (RMD files) and knitting them into Word, PDF, or HTML formats. To make a dashboard, all we need to do is make a single RMD file and tell the computer that we want to knit it into a dashboard instead of a standard report. Within this RMD file, we will also be able to tell the computer the following details:

  1. How many pages the dashboard should have.
  2. The layout of each page, meaning how many boxes or elements we want on each page and where they are located.
  3. The content of each element on each page such as data, text, or anything else that can be produced in R.

To continue learning about how to make a dashboard using the flexdashboard package, please download one of the following files:30

Open the file you downloaded in RStudio on your computer.

Within this file, notice the following features:

  1. This Dashboard Template file contains many comments within R code chunks. A comment is any text that comes after the # symbol. The computer will ignore any text or code that comes after the # symbol, so we can write notes and instructions into code chunks by simply writing the # symbol in front of any notes and instructions.

  2. There is a header at the very top, in the first few rows. This is where we tell the computer that we want to make a dashboard instead of a regular report. The header is the portion within the three --- hyphen characters.

  3. There is a “r setup” chunk at the top, like in other RMD files. Within this chunk, we can set the working directory, load any packages we need, load our data, and do any pre-processing steps. There are comments marking where each of these steps happen in the Dashboard Template file.

  4. We mark the start of new pages, columns, and rows using either 1, 2, or 3 hastag/pound symbols (#).

  5. R code chunks DO NOT show up in the final knitted output.

On your own computer, click the Knit button to knit the Dashboard Template on your own computer and see what it looks like. In your assignment you will be asked to modify this dashboard template on your own. If knitting does not immediately work on your computer, try the following two steps:

  1. Run the following command one time in the console (NOT in your RMarkdown file): install.packages("flexdashboard")

  2. Click on Run and then Run All in RStudio, to run all of the chunks in your RMarkdown file in order. Let the computer run all of your code chunks this way, resolve any errors, and then try knitting again.

The following videos are optional (not required) to watch and might be useful as you make your own dashboard:

7.6 Assignment

In this week’s assignment, you will respond to the examples you saw related to supervised machine learning using linear and logistic regression. Note that policy just refers to how you will actually use a predictive model in your program, institution, profession, and/or context. You will also make a data dashboard in R.

Please submit discussion posts by email to your discussion groups.

7.6.1 Discussion post, part A – Response to reading and videos

Below, you will prepare a new discussion post, based on the reading and videos from this week’s chapter.

Task 1: What were the main conclusions in the Stimpson and Cummings (2014) article you read?

Task 2: What are possible policy implications or opportunities for application of Stimpson and Cummings’s (2014) conclusions within the context in which they (Stimpson and Cummings) work? With this question, we are asking you to think narrowly about the uses of predictive analytics models within the context of a single institution or program of learning.

Task 3: What are possible policy implications or opportunities for application of Stimpson and Cummings’s (2014) conclusions in the field of education or health professions education as a whole? With this question, we are asking you to address the generalizability of Stimpson and Cummings’s (2014) work. How could the predictive models they made be used at other institutions or in other contexts? Would it be useful in a setting where you yourself work?

Task 4: Look at Table 4.3 within the hygiene violations case study in the Steuer (2018) reading. According to your own judgment, are these results good enough to warrant using this machine learning model to decide which restaurants to inspect or not in the future? Setting policy is all about comparing alternative approaches to each other and then choosing one. In your answer, consider the standard approach of how inspectors decide which restaurants to inspect if this machine learning model is not used. Is the standard approach or the machine learning approach better?

Task 5: Write any questions you have.

7.6.2 Discussion post, part B – Final project brainstorming

Now you will continue writing your discussion post, turning your attention to your plans for your final project in this class. Note that you do not need to do any data collection or analysis in this week’s work. Everything related to final project planning is just hypothetical/brainstorming.

Task 6: Which of the final project options for this class do you plan to choose? Refer to the final project information section in the first chapter of this textbook.

Task 7: Write your finalized research question for your final project in the course. Previously, you were asked to start brainstorming about a final project research question. Now it is time to finalize your plan so that you have enough time during the rest of the course to work on the project.

Task 8: How would this research question impact policy at your institution and/or within HPEd, if at all?

Task 9: Carefully break down the ethical benefits and detriments of answering this question using PA methods.31

Task 10: What data would be needed for this project to be executed? Do you have access to this data (it is fine if you do not)?

Task 11: Look again at Table 5.1 in the Steuer (2018) reading. Which practical or policy application in this table is most similar to the possible applications of your final project in this course (if you were to do it)?

Please be sure to submit your responses to all tasks to the appropriate places (you can submit parts A and B above in a single discussion post) and then continue to the next part of the assignment.32

7.6.3 Make a data dashboard

In this part of the assignment, you will practice making your own data dashboard in R.

Task 12: Please download the file DashboardTemplate2.RMD from https://drive.google.com/file/d/1v96v8mZkvYfEpT7wlqsV9q9TUBqXSS-R/view?usp=sharing to your own computer.33

Task 13: Rename the file that you downloaded.

Task 14: Choose a dataset of your own that you want to make a dashboard about. It is ideal if it is something related to your own work, but you are also welcome to use data that has been provided during this class. Feel free to discuss this with the instructors.

Task 15: Make changes to the Dashboard Template file so that the dashboard is now about your chosen dataset. Each element in the Dashboard Template file does NOT need to match each element in your own dashboard. For example, in the Dashboard Template, you might find a scatterplot in one part of the dashboard. You are not required to put a scatterplot in this place in your own dashboard; you can change it to something else. You are also welcome to change the layout of one or more pages or even add more pages to your dashboard.

Task 16: You are required to change at least one element from the Dashboard Template to something else. Every single element cannot be the same. Furthermore, all text like heading names and description text should be changed in your version from the Dashboard Template.

Task 17: Please submit both your RMD file and your final knitted dashboard file into the appropriate dropbox in D2L.

You have reached the end of this week’s assignment.


  1. At the start of this week of the course in summer 2023, only one file was made available here, called DashboardTemplate1.RMD. In the middle of the week, the file DashboardTemplate2.RMD was added here. It’s fine to use either of these files. DashboardTemplate2.RMD is slightly more user-friendly, but there aren’t big differences between the two.↩︎

  2. In this task, you should address the broad consequences of using PA methods to answer your research question. For example, in the context of educational analytics: If you were to make a machine learning model that gave you predictions (guesses) of the final exam score of each of your students (without you knowing the actual value of that person’s final exam score), what would be the benefits and risks of using that prediction to make a decision about that student’s education? What if the analytics help you make the right decision? What if the analytics lead you to make the wrong decision?↩︎

  3. On June 26 2023, it was incorrectly written here that the assignment ends at this point.↩︎

  4. Note on June 28 2023: If you instead downloaded the file DashboardTemplate1.RMD already, that is also fine.↩︎