Chapter 2 May 23–29: Data preparation and structures

Like the previous chapter in this textbook, this chapter contains everything that you need to do for Week 2 of HE-930.

IMPORTANT NOTES:

Next week is Week 3, when we have our synchronous seminar sessions. You are required to complete the preparation for next week by the time of our first synchronous session. You will not have the entire week to do next week’s preparation, which includes watching a movie on Netflix. Please plan accordingly.
You can turn in the assignment for Week 2 (this chapter) as late as Sunday June 5 2022, if you would like. That way, you can prioritize preparation for Week 3’s synchronous seminars.¹¹

This week, our goals are to…

Distinguish between data types.
Determine which preprocessing techniques are appropriate for which situations.
Become familiar with how learning analytics data are structured.

2.1 Data structures and pre-processing

Last week, we learned about the types of problems that predictive analytics (PA) and machine learning (ML) can help us solve. We also read examples of how ML is being used in practice within education and healthcare. Now that we have started to explore what PA and ML can do, we have to start learning about how it actually works. One of the very first steps is to have the correct data available and make sure that it is organized and prepared in a way that that ML algorithms will work. This week, we will explore how data should be organized and prepared such that it can be used for PA.

Here are a few notes to get started:

Data are often placed into a spreadsheet format, with each row representing an observation (which could be a person, school, piece of equipment, anything else; basically anything that you want to collect information about).
Keep in mind that “features” in a dataset refer to characteristics of each observation (row) in the data. Features are also called variables or columns.
There is no definitive list of pre-processing steps/actions. It depends on the specific PA/ML algorithms you use. We are learning about some common pre-processing steps this week. Later in the class, you’ll practice implementing some of them.

Please read the following to learn more about data structures and pre-processing:

Re-read the “Data pre-processing and feature extraction” section on pp. 6 & 7 as well as Table 14 on p. 18 in the Akçapınar et al 2019 article, which you already read last week.
Boehmke & Greenwell. 2020. Chapter 3 “Feature & Target Engineering” in Hands-On Machine Learning with R. https://bradleyboehmke.github.io/HOML/engineering.html – Do your best to understand this. Feel free to mark anything that is confusing and then keep moving (and then ask later about any confusing parts).
Hale, J. 2018. “7 Data Types: A Better Way to Think about Data Types for Machine Learning.” Towards Data Science. https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689.

Like last week, these items can also be accessed in the Week 2 section of the Content section within the D2L website for this course.

2.2 Example dataset

Now we will explore a dataset that is organized in a well-suited way for PA and ML in the education context. This is called the Student Performance Dataset.

Please read the description of the dataset and the list of variables it contains at the following page:

Student Performance Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/student+performance.

Next, download the dataset to your own computer and open it. You can download it from D2L or from the page above by clicking on Data Folder and then student.zip.¹²

The version of the data from D2L has been found to be more user-friendly and better formatted in the past. You should be able to open the file in Excel or a similar spreadsheet software.

The source for the Student Performance Dataset is:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. http://www3.dsi.uminho.pt/pcortez/student.pdf. – You are not required to read this article.

You will refer to the student-por.csv dataset in your assignment this week.

2.3 Assignment

In this week’s assignment, you will brainstorm about research questions that could be answered using predictive analytics (PA) and machine learning (ML) and consider which data would be needed to answer those questions.

Unlike in most weeks, you are allowed to take an additional week to submit this assignment. If this were a normal week, the deadline would be May 29 2022, but instead the deadline has been extended to June 5 2022. This is to give you more time to prepare for next week’s synchronous seminar sessions, if needed.¹³

2.3.1 Response to Week 1 discussion

Task 1: In D2L, please go back to the Week 1 discussion board and write a 100-word response or more to at least one discussion post written by a classmate of yours.

2.3.2 Discussion post

Now you will prepare a new discussion post for the Week 2 discussion board in D2L.

Task 2: After looking at the Student Performance Dataset in Excel and reading its description, identify a research question that could be asked based on this data that could be addressed using ML. Think about the various research questions you encountered last week as you brainstorm.

Task 3: Which pre-processing steps would be required to prepare this dataset for analysis? Refer to this week’s readings for guidance.

Task 4: Which types of visualizations or descriptive tables of this data would help you determine if the data is well-balanced and well-suited for predictive analysis?

Task 5: What all pre-processing actions did Akçapınar et al (2019) take?

Task 6: What are some pre-processing actions that Akçapınar et al did not take that they could have, in your judgment?

Task 7: How do the pre-processing actions taken by Akçapınar et al compare to the proposed pre-processing actions that you identified above for the Student Performance Dataset?

Task 8: Do you have any questions or uncertainties regarding how data should be structured and pre-processed before it is used for PA and ML?

2.3.3 Prepare for synchronous seminar next week

Next week in this course (Week 3), we will hold multiple synchronous seminar sessions that you are required to attend. Prior to attending these sessions, we would like everyone to prepare by doing the following tasks.

Task 9: Watch the 25-minute video at https://youtu.be/wGE7C5w6hb4.

Task 10: Watch The Social Dilemma on Netflix. This link might take you there: https://www.netflix.com/title/81254224. If you do not have access to this on your own, inform the course instructors and we will arrange for you to watch. This will take about 1.5 hours to watch.

Task 11: Review all Week 1 readings.

You have reached the end of this week’s assignment. Please make sure to submit your responses to all tasks to the appropriate places (response in Week 1 discussion board, new discussion post in Week 2 discussion board).

This note was added on May 24 2022. Everyone in the class was also emailed about this on May 24 2022.↩︎
The file student.zip will then download to your computer. Within that file, find the file called student-por.csv, which is the only file that you need to open and use.↩︎
This note was changed on May 24 2022. Everyone in the course was also emailed about this change.↩︎