Chapter 2 Data preparation, structures, and visualization
IMPORTANT NOTES:
- Next week is Week 3, when we have our synchronous seminar sessions. You are required to complete the preparation for next week by the time of our first synchronous session. You will not have the entire week to do next week’s preparation, which includes watching a movie on Netflix. Please plan accordingly.
This week, our goals are to…
Distinguish between data types.
Determine which preprocessing techniques are appropriate for which situations.
Become familiar with how learning analytics data are structured.
See examples of and brainstorm about how analytics data and results can be visualized and displayed.
2.1 Data structures and pre-processing
Last week, we learned about the types of problems that predictive analytics (PA) and machine learning (ML) can help us solve. We also read examples of how ML is being used in practice within education and healthcare. Now that we have started to explore what PA and ML can do, we have to start learning about how it actually works. One of the very first steps is to have the correct data available and make sure that it is organized and prepared in a way that that ML algorithms will work. This week, we will explore how data should be organized and prepared such that it can be used for PA. Then, we will learn about how the data and results related to PA can be visualized and displayed.
Here are a few notes to get started:
Data are often placed into a spreadsheet format, with each row representing an observation (which could be a person, school, piece of equipment, anything else; basically anything that you want to collect information about).
Keep in mind that “features” in a dataset refer to characteristics of each observation (row) in the data. Features are also called variables or columns.
There is no definitive list of pre-processing steps/actions. It depends on the specific PA/ML algorithms you use. We are learning about some common pre-processing steps this week. Later in the class, you’ll practice implementing some of them.
Please read the following to learn more about data structures and pre-processing:
Re-read the “Data pre-processing and feature extraction” section on pp. 6 & 7 as well as Table 14 on p. 18 in the Akçapınar et al 2019 article, which you already read last week. This is the most detailed description of data preparation out of last week’s content, which is why we are revisiting it this week.
Boehmke & Greenwell. 2020. Chapter 3 “Feature & Target Engineering” in Hands-On Machine Learning with R. https://bradleyboehmke.github.io/HOML/engineering.html – Do your best to understand this. Feel free to mark anything that is confusing and then keep moving (and then ask later about any confusing parts).
Hale, J. 2018. “7 Data Types: A Better Way to Think about Data Types for Machine Learning.” Towards Data Science. https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689.
Some of these items can also be accessed in the Week 2 section of the Content
section within the D2L website for this course.
2.2 Example dataset
Now we will explore a dataset that is organized in a well-suited way for PA and ML in the education context. This is called the Student Performance Dataset.
Please read the description of the dataset and the list of variables it contains at the following page:
- Student Performance Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/student+performance.
Next, download the dataset to your own computer and open it. You can download it from D2L or from the page above by clicking on Data Folder
and then student.zip
.11
The version of the data from D2L has been found to be more user-friendly and better formatted in the past. You should be able to open the file in Excel or a similar spreadsheet software.
The source for the Student Performance Dataset is:
- P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. http://www3.dsi.uminho.pt/pcortez/student.pdf. – You are not required to read this article.
We will call this the student-por
or student-por.csv
dataset in the future.
2.3 Data dashboards
As part of this course, we will learn how to make a data dashboard using the flexdashboard
package in R. Please see the following examples of dashboards that other people have made:
MOSAIC Study Progress, by Jennifer Thompson: https://jenthompson.me/examples/progressdash.html. Be sure to click through every page on the dashboard using the navigation panel at the top. Also, notice that you can interact with many of the charts to get additional data or change the display. If you want, you can also read the author’s article about the making of this dashboard (not required): https://jenthompson.me/2018/02/09/flexdashboards-monitoring/.
Locating neighborhood diversity in the American metropolis, by Kyle Walker: https://walkerke.shinyapps.io/neighborhood_diversity/.
HTML Widgets Showcase: https://testing-apps.shinyapps.io/flexdashboard-storyboard/. This is an example of a storyboard, which is another capability of the
flexdashboard
package.
When you design your own data dashboard, you can decide for yourself how many pages it will have, which specific data displays will be on each page and how they will be arranged, text and captions, colors, and more! The resources below can help you make some of these decisions (you are not required to look at these):
The R Graph Gallery, by Yan Holtz: https://r-graph-gallery.com/
flexdashboard: Easy interactive dashboards for R, by RStudio Team: https://posit.co/blog/flexdashboard-easy-interactive-dashboards-for-r/
Section “5.2 Components”, by Yihui Xie: https://bookdown.org/yihui/rmarkdown/dashboard-components.html
2.4 Assignment
In this week’s assignment, you will brainstorm about research questions that could be answered using predictive analytics (PA) and machine learning (ML), consider which data would be needed to answer those questions, and brainstorm about ways to visualize and present analytics results.
Here is a summary of work for this week:
- 100-word (or more) response to last week’s discussions (respond in Week 1 discussion board in D2L)
- Answers to this week’s discussion questions (respond in new Week 2 discussion board in D2L)
- Excel (or similar) example data spreadsheet (submit in D2L assignment dropbox)
- PowerPoint (or similar) file with data dashboard template (submit in D2L assignment dropbox)
- Answer 15 flashcards in the Adaptive Learner App.
Keep in mind that all of your work might be shared with all students in the course, for peer review or sharing ideas.
2.4.1 Response to week 1 discussion
Task 1: In last week’s D2L discussion board, please write a thoughtful response to one of your classmates’ posts that is at least 100 words long.
2.4.2 Discussion post and Excel assignment for week 2
Now you will prepare a new discussion post for Week 2, which you will post in the new D2L discussion board for this week. Below, you are assigned a case example from last week’s Ekowo and Palmer (2016) reading.
Which case you should focus on, for each student:
- KA, HE, AG, EG: Focus on the case “Helping Students Select Courses at Austin Peay State University”, on p. 9.
- HL, LL, SN, KP: Focus on the case “Predicting Enrollment at Wichita State University”, on p. 11.
- KR, DS, RS, AY: Focus on the case “Adaptive Learning at Colorado Technical University”, on p. 10.
Note that this is NOT a group assignment. You are working invidivually.
Task 2: For the case example assigned to you from the Ekowo and Palmer (2016) reading, think about what data would be needed to actually execute the analyses and get the results that are described. Write a detailed, step-by-step procedure and description of what the people doing the data collection and analysis would need to do in order to get the right data and be able to put it into a spreadsheet that could be realistically used for analysis.
Task 3: In Excel or a similar spreadsheet software, make a data spreadsheet that looks how you imagine the data would look for the case that you are focusing on. Write all variable names into each column in the first row of the spreadsheet. You should also include five rows of fake data underneath the header row. In other words: we are asking you to imagine what would the data spreadsheet look like for this analysis to be possible.
Task 4: Once the data are collected into the data spreadsheet, what all pre-processing steps would need to happen before the data can be anaylyzed by the computer? You can refer to this week’s assigned readings for ideas and examples about this.
2.4.3 Design a data dashboard
Later in this course, we will learn how to make data dashboards using the flexdashboard
package in R. You saw some examples of this in this week’s materials, which will be useful as you think about making your own dashboard. In this part of the assignment, we will complete one of the first steps towards making a data dashboard ourselves.
Task 5: This week, your job is to create a mock-up in PowerPoint—or a similar platform—of a dashboard that you might make for your own use in your own professional work and/or at your organization. This week, you will just design the look and layout of the dashboard that you will eventually make. Please locate the template file called “dashboard template 1” in this week’s D2L folder and download a copy to your computer (be sure to change the file name so that it includes your own name). Fill in this template with a description of the dashboard that you wish to make and then design the content and layout of at least three pages of the dashboard, giving as much detail as possible. Please see further instructions within the template file itself. Soon, in a future week of the course, we will peer-review each other’s mock-ups.
2.4.4 Prepare for synchronous seminar next week
Next week in this course (Week 3), we will hold multiple synchronous seminar sessions that you are required to participate in.12 Prior to attending these sessions, we would like everyone to prepare by doing the following tasks.
Task 6: Watch the 25-minute video at https://youtu.be/wGE7C5w6hb4.
Task 7: Watch The Social Dilemma on Netflix. This link might take you there: https://www.netflix.com/title/81254224. If you do not have access to this on your own, inform the course instructors and we will arrange for you to watch. This will take about 1.5 hours to watch.
Task 8: Review all Week 1 readings.
You have reached the end of this week’s assignment. Please make sure to submit your responses to all tasks to the appropriate places.
2.4.5 Answer flashcard quiz questions
Task 9: Like in every week in this class, please answer 15 questions in the Adaptive Learner App, which can be found at https://educ-app-2.vercel.app/{_target=“_blank”}.
If you download it from D2L, then just open the file you downloaded, which should be called
student-por.csv
. If you download from uci.edu, then probably the filestudent.zip
will be download to your computer. Within that file, find the file calledstudent-por.csv
, which is the only file that you need to open and use.↩︎If you are not able to attend the scheduled sessions, please contact the instructors right away to arrange an alternative.↩︎