Chapter 2 May 22–28: Data preparation and structures

Like the previous chapter in this textbook, this chapter contains everything that you need to do for Week 2 of HE-930.

IMPORTANT NOTES:

  • Next week is Week 3, when we have our synchronous seminar sessions. You are required to complete the preparation for next week by the time of our first synchronous session. You will not have the entire week to do next week’s preparation, which includes watching a movie on Netflix. Please plan accordingly.

This week, our goals are to…

  1. Distinguish between data types.

  2. Determine which preprocessing techniques are appropriate for which situations.

  3. Become familiar with how learning analytics data are structured.

  4. See examples of and brainstorm about how analytics data and results can be visualized and displayed.

2.1 Data structures and pre-processing

Last week, we learned about the types of problems that predictive analytics (PA) and machine learning (ML) can help us solve. We also read examples of how ML is being used in practice within education and healthcare. Now that we have started to explore what PA and ML can do, we have to start learning about how it actually works. One of the very first steps is to have the correct data available and make sure that it is organized and prepared in a way that that ML algorithms will work. This week, we will explore how data should be organized and prepared such that it can be used for PA. Then, we will learn about how the data and results related to PA can be visualized and displayed.

Here are a few notes to get started:

  1. Data are often placed into a spreadsheet format, with each row representing an observation (which could be a person, school, piece of equipment, anything else; basically anything that you want to collect information about).

  2. Keep in mind that “features” in a dataset refer to characteristics of each observation (row) in the data. Features are also called variables or columns.

  3. There is no definitive list of pre-processing steps/actions. It depends on the specific PA/ML algorithms you use. We are learning about some common pre-processing steps this week. Later in the class, you’ll practice implementing some of them.

Please read the following to learn more about data structures and pre-processing:

  1. Re-read the “Data pre-processing and feature extraction” section on pp. 6 & 7 as well as Table 14 on p. 18 in the Akçapınar et al 2019 article, which you already read last week. This is the most detailed description of data preparation out of last week’s content, which is why we are revisiting it this week.

  2. Boehmke & Greenwell. 2020. Chapter 3 “Feature & Target Engineering” in Hands-On Machine Learning with R. https://bradleyboehmke.github.io/HOML/engineering.html – Do your best to understand this. Feel free to mark anything that is confusing and then keep moving (and then ask later about any confusing parts).

  3. Hale, J. 2018. “7 Data Types: A Better Way to Think about Data Types for Machine Learning.” Towards Data Science. https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689.

Like last week, these items can also be accessed in the Week 2 section of the Content section within the D2L website for this course.

2.2 Example dataset

Now we will explore a dataset that is organized in a well-suited way for PA and ML in the education context. This is called the Student Performance Dataset.

Please read the description of the dataset and the list of variables it contains at the following page:

Next, download the dataset to your own computer and open it. You can download it from D2L or from the page above by clicking on Data Folder and then student.zip.11

The version of the data from D2L has been found to be more user-friendly and better formatted in the past. You should be able to open the file in Excel or a similar spreadsheet software.

The source for the Student Performance Dataset is:

  • P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. http://www3.dsi.uminho.pt/pcortez/student.pdf. – You are not required to read this article.

We will call this the student-por or student-por.csv dataset in the future.

2.3 Data dashboards

As part of this course, we will learn how to make a data dashboard using the flexdashboard package in R. Please see the following examples of dashboards that other people have made:

When you design your own data dashboard, you can decide for yourself how many pages it will have, which specific data displays will be on each page and how they will be arranged, text and captions, colors, and more! The resources below can help you make some of these decisions (you are not required to look at these):

2.4 Assignment

In this week’s assignment, you will brainstorm about research questions that could be answered using predictive analytics (PA) and machine learning (ML), consider which data would be needed to answer those questions, and brainstorm about ways to visualize and present analytics results.

Like before, you will again submit your work from this week over email to your small discussion group. You will submit the following items (in one or multiple emails, whichever you prefer):

  1. 100-word (or more) response to last week’s discussions.
  2. Answers to this week’s discussion questions.
  3. Excel (or similar) example data spreadsheet.
  4. PowerPoint (or similar) file with data dashboard template.

Once again, keep in mind that all of your work might be shared with all students in the course, not just your small discussion group.

2.4.1 Response to Week 1 discussion

Task 1: In your email, please read the responses written by your group-mate(s) last week. Then, reply-all to your small discussion group with a 100-word response or more with a thoughtful response of your choosing.

2.4.2 Discussion post

Now you will prepare a new discussion post for the Week 2, which you will also send to the very same discussion group as last week, over email. Below, you are assigned a case example from last week’s Ekowo and Palmer (2016) reading.

Which case you should focus on, for each student:

  • ED: Focus on the case “Helping Students Select Courses at Austin Peay State University”, on p. 9. Your groupmates are still MM and JM.
  • MM: Focus on the case “Predicting Enrollment at Wichita State University”, on p. 11. Your groupmates are still ED and JM.
  • JM: Focus on the case “Adaptive Learning at Colorado Technical University”, on p. 10. Your groupmates are still ED and MM.
  • FS: Focus on the case “Helping Students Select Courses at Austin Peay State University”, on p. 9. Your groupmates are still SA and RB.
  • SA: Focus on the case “Predicting Enrollment at Wichita State University”, on p. 11. Your groupmates are still FS and RB.
  • RB: Focus on the case “Adaptive Learning at Colorado Technical University”, on p. 10. Your groupmates are still FS and SA.
  • DR: Focus on the case “Helping Students Select Courses at Austin Peay State University”, on p. 9. Your groupmate is still DT.
  • DT: Focus on the case “Predicting Enrollment at Wichita State University”, on p. 11. Your groupmate is still DR.

Task 2: For the case example assigned to you from the Ekowo and Palmer (2016) reading, think about what data would be needed to actually execute the analyses and get the results that are described. Write a detailed, step-by-step procedure and description of what the people doing the data collection and analysis would need to do in order to get the right data and be able to put it into a spreadsheet that could be realistically used for analysis.

Task 3: In Excel or a similar spreadsheet software, make a data spreadsheet that looks how you imagine the data would look for the case that you are focusing on. Write all variable names into each column in the first row of the spreadsheet. You should also include five rows of fake data underneath the header row. Submit the spreadsheet over email to your discussion group along with the rest of your answers. In other words: we are asking you to imagine what would the data spreadsheet look like for this analysis to be possible.

Task 4: Once the data are collected into the data spreadsheet, what all pre-processing steps would need to happen before the data can be anaylyzed by the computer? You can refer to this week’s assigned readings for ideas and examples about this.

2.4.3 Design a data dashboard

Later in this course, we will learn how to make data dashboards using the flexdashboard package in R. You saw some examples of this in this week’s materials, which will be useful as you think about making your own dashboard. In this part of the assignment, we will complete one of the first steps towards making a data dashboard ourselves.

Task 5: This week, your job is to create a mock-up in PowerPoint—or a similar platform—of a dashboard that you might make for your own use in your own professional work and/or at your organization. This week, you will just design the look and layout of the dashboard that you will eventually make. Please locate the template file called “dashboard template 1” in this week’s D2L folder and download a copy to your computer (be sure to change the file name so that it includes your own name). Fill in this template with a description of the dashboard that you wish to make and then design the content and layout of at least three pages of the dashboard, giving as much detail as possible. Please see further instructions within the template file itself. You should also submit your finished dashboard mock-up to your small discussion group. Soon, in a future week of the course, we will peer-review each other’s mock-ups.

2.4.4 Prepare for synchronous seminar next week

Next week in this course (Week 3), we will hold multiple synchronous seminar sessions that you are required to attend. Prior to attending these sessions, we would like everyone to prepare by doing the following tasks.

Task 6: Watch the 25-minute video at https://youtu.be/wGE7C5w6hb4.

Task 7: Watch The Social Dilemma on Netflix. This link might take you there: https://www.netflix.com/title/81254224. If you do not have access to this on your own, inform the course instructors and we will arrange for you to watch. This will take about 1.5 hours to watch.

Task 8: Review all Week 1 readings.

You have reached the end of this week’s assignment. Please make sure to submit your responses to all tasks to the appropriate places (response in Week 1 discussion board, new discussion post in Week 2 discussion board).


  1. If you download it from D2L, then just open the file you downloaded, which should be called student-por.csv. If you download from uci.edu, then probably the file student.zip will be download to your computer. Within that file, find the file called student-por.csv, which is the only file that you need to open and use.↩︎