Chapter 7 Jun 27–July 3: Linear and logistic regression for predictive analytics

This week, our goals are to…

Grasp the similarities between traditional and PA (predictive analytics) uses of regression, especially the separation of training and testing data, training a model with training data, and then plugging testing data into the regression equation.
Recognize how linear and logistic regression can be used for supervised machine learning.
Brainstorm about final project research question and methods.
Evaluate policy implications of PA research.

7.1 Regression and classification in supervised machine learning

This chapter (and the majority of the rest of this textbook) is entirely related to supervised machine learning. Within supervised machine learning, we typically need to conduct either regression or classification, depending on what exactly we are trying to predict.

What is the difference between regression and classification?

Regression refers to a predictive model that predicts a numeric variable. For example, if you want to predict a student’s test score or how many grams of ice cream a person eats each day, you will need to use regression.
Classification refers to a predictive model that predicts a categorical variable. For example, if you want to predict whether a student will be admitted or not to an educational program or a person’s favorite ice cream flavor, you will need to use classification.
Both regression and classification are supervised machine learning methods.

This reference table further points out differences and similarities:

Type of model	Dependent variable	Dependent variable examples	Methods examples
regression	numeric	test scores, stock prices, how fast somebody can run a mile, how many potatoes somebody will purchase at the store	linear regression, k-nearest neighbors regression, random forest regression, neural network regression, many others
classification	categorical	favorite ice cream flavor, admissions decision (yes or no), whether or not someone buys a product (yes or no), political party.	logistic regression, k-nearest neighbors classification, random forest classification, neural network classification, many others

Notes:

Many types of machine learning work for both regression and classification. That’s why many methods appear in both the regression and classification sections of the table above.
Confusingly, even though logistic regression has the word “regression” in its name, it is used for classification and not regression, in this context.

7.2 Using linear and logistic regression for predictive analytics

Please watch the following videos which demonstrate how regression methods can be used to conduct supervised machine learning:

Linear Regression in R. Linear Regression in R With Example. Data Science Algorithms. Simplilearn. https://www.youtube.com/watch?v=2Sb1Gvo5si8. Watch at the YouTube link or embedded below.

Logistic Regression in R. Logistic Regression in R Example. Data Science Algorithms. Simplilearn. https://www.youtube.com/watch?v=XycruVLySDg. Watch at the YouTube link or embedded below.

We will now turn our attention to some examples in which linear and logistic regressions have been applied.

7.3 Scholarship involving linear and logistic regression

Below, we will see an example in which linear and logistic regression have been used for supervised machine learning.

Please read the following article:

Stimpson AJ, Cummings ML. Assessing intervention timing in computer-based education using machine learning algorithms. IEEE Access. 2014 Jan 31;2:78-87. DOI: https://doi.org/10.1109/ACCESS.2014.2303071. PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6730683.

7.4 Applying machine learning results in policy and the real world

Please read the following regarding how machine learning can be practically used to influence policy actions or decisions:

Read pp. 12–19, Section 4, “CASE STUDY – PREDICTING HYGIENE VIOLATIONS” as well as pp. 21–23, Section 5.3, “Examples from Public Policy Making” in: Steuer F. 2018. Machine Learning for Public Policy Making. How to Use DataDriven Predictive Modeling for the Social Good. Student Paper Series 46. https://www.ibei.org/ibei_studentpaper46_162056.pdf

In the reading above, please pay extra attention to the two health policy rows within table 5.1 on p. 23 (the examples about lead poisoning and heart failure).

Additionally, please skim these articles about how artificial intelligence can be used in education:

Lynch M. March 6 2019. “6 WAYS MACHINE LEARNING WILL REVOLUTIONIZE THE EDUCATION SECTOR”. The Tech Edvocate. https://www.thetechedvocate.org/6-ways-machine-learning-will-revolutionize-the-education-sector/
Lynch M. June 12 2018. “8 WAYS MACHINE LEARNING WILL IMPROVE EDUCATION”. The Tech Edvocate. https://www.thetechedvocate.org/8-ways-machine-learning-will-improve-education/.

7.5 Assignment

In this week’s assignment, you will respond to the examples you saw related to supervised machine learning using linear and logistic regression. Note that policy just refers to how you will actually use a predictive model in your program, institution, profession, and/or context.

7.5.1 Discussion post, part A – Response to reading and videos

Below, you will prepare a new discussion post for the discussion board in D2L, based on the reading and videos from this week’s chapter.

Task 1: What were the main conclusions in the Stimpson and Cummings (2014) article you read?

Task 2: What do you think Stimpson and Cummings (2014) did well in their research?

Task 3: What do you think Stimpson and Cummings (2014) could have done better in their research?

Task 4: What are possible policy implications or opportunities for application of Stimpson and Cummings’s (2014) conclusions within the context in which they (Stimpson and Cummings) work? With this question, we are asking you to think narrowly about the uses of predictive analytics models within the context of a single institution or program of learning.

Task 5: What are possible policy implications or opportunities for application of Stimpson and Cummings’s (2014) conclusions in the field of education or health professions education as a whole?? With this question, we are asking you to address the generalizability of Stimpson and Cummings’s (2014) work. How could the predictive models they made be used at other institutions or in other contexts? Would it be useful in a setting where you yourself work?

Task 6: Look at Table 4.3 within the hygiene violations case study in the Steuer (2018) reading. According to your own judgment, are these results good enough to warrant using this machine learning model to decide which restaurants to inspect or not in the future? Setting policy is all about comparing alternative approaches to each other and then choosing one. In your answer, consider the standard approach of how inspectors decide which restaurants to inspect if this machine learning model is not used. Is the standard approach or the machine learning approach better?

Task 7: Write any questions you have.

7.5.2 Discussion post, part B – Final project brainstorming

Now you will continue writing your discussion post, turning your attention to your plans for your final project in this class. Note that you do not need to do any data collection or analysis in this week’s work. Everything related to final project planning is just hypothetical/brainstorming.

Task 8: Which of the final project options for this class do you plan to choose? Refer to the final project information section in the first chapter of this textbook.

Task 9: Write your finalized research question for your final project in the course. Previously, you were asked to start brainstorming about a final project research question. Now it is time to finalize your plan so that you have enough time during the rest of the course to work on the project.

Task 10: How would this research question impact policy at your institution and/or within HPEd, if at all?

Task 11: Carefully break down the ethical benefits and detriments of answering this question using PA methods.³²

Task 12: Would linear regression help answer your question? Would logistic regression help answer it?

Task 13: How would the traditional statistics use of regression analysis differ from the PA approach, in the context of your drafted research question?

Task 14: How does your research question compare to those addressed by Stimpson and Cummings (2014) and some of the research questions given as examples in the videos you watched?

Task 15: Which PA/ML methods do you think will be most useful for answering the research question in your final project?

Task 16: What data would be needed for this project to be executed? Do you have access to this data (it is fine if you do not)?

Task 17: Look again at Table 5.1 in the Steuer (2018) reading. Which practical or policy application in this table is most similar to the possible applications of your final project in this course (if you were to do it)?

You have reached the end of this week’s assignment. Please be sure to submit your responses to all tasks to the appropriate places (you can submit parts A and B above in a single discussion post).

In this task, you should address the broad consequences of using PA methods to answer your research question. For example, in the context of educational analytics: If you were to make a machine learning model that gave you predictions (guesses) of the final exam score of each of your students (without you knowing the actual value of that person’s final exam score), what would be the benefits and risks of using that prediction to make a decision about that student’s education? What if the analytics help you make the right decision? What if the analytics lead you to make the wrong decision?↩︎