HUDK4050 at a Glance

Instructor:
Charles Lang ()
Course Assistants:
Hsiao Yang ()
Zhongyuan Zhang ()

Office Hours:
Please sign up for office hours here.

Updates Updates are communicated via TC email address and WeChat group

Credits: 3

Prerequisites: None

Time/Dates:
Synchronous session times will be determined based on enrolled student timezones, for exact times/dates please subscribe to the HUDK4050 Google Calendar (You will need to be logged in to your TC Google account). There are some issues when accessing this link, if you cannot see the calendar please download the calendar file here (Right click -> “Save link as”) and open the file on your device.

Format:
HUDK4050 is conducted as a mix of both synchronous (code workouts, hackathons, office hours) and asynchronous (videos, code drills, assignments) content.

Content:
HUDK4050 is an introduction to educational data mining/learning analytics/educational data science/learning engineering with a focus on skills needed to work in the field. Students should expect to gain proficiency in developing an educational data solution from start to finish. We focus on counting stuff, data wrangling, making meaning from data and using it to improve educational systems and processes.

Assessment:

  • Six assignments
  • Twenty four code exercises
  • Participation (hackathons, video comments)

Topics: The course is split into five key questions in education that we will tackle from a data science perspective:

  • Q1: How do we create community in online spaces?
  • Q2: How do deal with so much data?
  • Q3: How do we know if learners are engaging with material?
  • Q4: How do we predict what learners will do next?

Topic and Assessment Schedule

Date Topic Assessment
Sep 4 Introduction
Sep 17 Introduction Assignment 1 Due
Sep 22 Q1 Hackathon 1
Oct 1 Q1 Assignment 2 Due
Oct 6 Q1 Hackathon 2
Oct 15 Q1 Assignment 3 Due
Oct 20 Q2 Hackathon 3
Oct 29 Q2 Assignment 4 Due
Nov 10 Q3 Hackathon 4
Nov 19 Q3 Assignment 5 Due
Dec 1 Q4 Hackathon 5
Dec 10 Q4 Assignment 6 Due

Week 1 Introduction

1.1 Welcome to HUDK4050

Introduction Video

If you cannot access the video please email

If you would like to help translate the video into another language click here

Video Transcript

# Welcome to Core Methods in Educational Data Mining. A course that introduces students to the complicated, exciting and messy intersection of big data and education.
 
# In this introductory video I will explain the details of the course and why you might want to take it, and then more importantly why you might want to stick with it over the next two and half months.
 
# We will cover the:
 
# Goals
# Format 
# Activities & assessment
# Technology 
# and some administrative details
 
# But first some introductions, I am Charles your instructor and together with two wonderful Course Assistants, Hsiao and Zhongyuan, we are here to help you learn as much as possible. Hsiao and Zhongyuan are both second year Learning Analytics students and I am a researcher who works on digital literacy and personalization algorithms. 
 
# But before we get to the goals of the course we need a little context. This course exists because of the incredible growth in data facilitated by the internet, mobile computing, new sensors and increases in computational power. The IDC estimates the amount of data being collected per year to reach a stupid big number like 175 ZB by 2025. 
 
# This trend has occurred across all industries but we are concerned here with education: how the data can be collected, processed, analyed, visualized and put to use in the service of educational systems.
 
# Of course, it remains an open question how all this data will be utilized and whether or not it even should be. There are certainly benefits. We can automate processes so that they are faster and cheaper, we can find insights including injustices that we might otherwise miss, and crucially we can show students when and how they are learning in ways that are not otherwise possible. Clearly there is a substantial upside here, however, when we get it wrong we can cause a lot of problems - as the United Kingdom just found out when they tried to implement a grade estimation model.  
 
# So the goals of this course are to start you on a journey to become good at this work - the job of taking large amounts of educationally relevant data, making use of it, but doing so in a responsible way. In my experience there are a few important traits that make you good at this kind of work:
 
# The first is that you are good at counting things, which may sounds trivial but the ability to take a problem or a situation and say, “Here are some things we could count” is an important aspect of problem decomposition a skill that is in high demand.
 
# The second trait s being good at processing those counts - by which I mean you are good at cleaning up data so it is reliable and good at combining data sets and variables to optimize the information that is available. This usually means you are fast and accurate at coding.
 
# The third trait is that you are good at making meaning from the processed counts to create informative metrics. You have statistical knowledge and can characterize uncertainty.
 
# And fourth you need to be good at using the insights you gain from data to make systems better, not worse. You can through the impact of using specific data within an educational conext and have a good understanding of how quantitative information can impact different stakeholders
 
# So the goals of this course are to help you improve in each of these buckets and we will do that trying to answer four big questions from a data science perspective:
 
# How do we create community in online educational spaces?
# How do we deal with so much data?
# How do know if learners are engaging with material?
# How do we predict what learners might do next?
 
# And we endeavor to complete a set of activities and assessments to help you do just that.

# Activities & Assessments
 
# Let’s talk activities first. Because we are spread across the globe we need to deal with a varied number of timezones. However, if we do a completely asynchronous course (as in it is self paced) then we lose some of the community building aspects that research tells us are important for learning. So the strategy will be to  have a mix of synchronous and asynchronous content, trying to play to the strengths of each format. So synchronous sessions will be reserved for activities where we work together on some problem or project and asynchronous content will be predominantly information delivery like this video and self directed practice.  
 
# We will have three flavors of synchronous session:
 
# Hackathons
# Coding workouts
# Office Hours
 
# A hackathon session will last two hours and will involve group work on a specific educational data problem with a presentation of your work at the end. There will be five in total.
 
# A code workout will involve live instruction for 30 minutes in a coding exercise - it is short but high demand.
 
# Office hours are available each week for consultation and conversation about any questions or concerns you may have. They tend to be one or two students and you need to sign up for a time.
 
# The exact times of these sessions will be determined based on the timezones students are located in. You will have received a survey link to tell us your timezone so that we can schedule them. You should also subscribe to the HUDK4050 Google calendar so that you can know when those sessions will occur. 
 
# On the asynchronous side of things there are also three flavors of session.
 
# Videos like this one that you can watch and comment on
# Six assignments you will submit individually which are based on your hackathon work
# Basic coding drills you will also complete individually
 
# These synchronous and asynchronous activities form the assessment for the class. But with the caveat that we are looking for growth over performance. Everyone is coming into this work with different prior experience so there is no comparative measures, no grading on a curve, what we want to see are honest attempts at the work and improvement across the four buckets I mentioned:
 
# Counting, processing counts, making meaning from numbers and the application of metrics to systems.
 
# Staying on top of all these activities is a challenge so again, make sure you subscribe to the HUDK4050 Google Calendar.
 
# Which brings us to the technology part of the course. There are four core pieces of technology that we will be using:
 
# The R programming language
# RStudio an Integrated Development Environment for the R programming language
# The GIT version control system and 
# Github online repository system
 
# Don’t worry if you have never heard of any of these, we will be going over all of them in detail. 
 
# So, in summary you should take this course if you are looking to dip your toes into big data and education and you should stick with it if you want to develop your skills in applying quantitive analyses to real world educational problems. 
 
# If this is you, and we hope it is you have four tasks to complete by Thursday:
 
# 1. Watch this video - look at you you're already ahead of the game
# 2. Go to github.com and make an account - it is best to use a professional username as this is for your professional life
# 3. Subscribe to the HUDK4050 Google calendar (link in the email) so that you will be up to date on what is happening in the course
# 4. Complete the survey that was sent to you with this video
 
# After that, you are good to go. Welcome to to COre Methods in EDM, it is going to be a great semester, I look forward to meeting you all!

1.2 R/RStudio & Git/Github

R/RStudio & Git/Github

Video Transcript

# Good morning, good afternoon, good evening wherever you happen to be in the world. Welcome to Core Methods in EDM.
 
# Let's start off by taking a look at the course website. As you have already noticed, we don't use canvas, this is because I want you to learn platforms that you will use in your work life. So the course website is built in R Markdown language and hosted on GitHub - two technologies we will be jumping into today. 
 
# By way of a small tour, you can find the sign up for office hours in the bottom left and you can change some viewing preferences at the top, like the size and the color scheme - if you want to save your eyes you can go with night mode. Transcripts for the videos arehidden by default but you can reveal them with show button - very useful if you don't want to listen to my terrible accent.

# The site is organized as a feed with new content appearing every Tuesday and Thursday. If you are looking for videos, readings, assignments or other information you will find it here, organized by week.

# Here I have included today's videos and some links to some relevant events and opportunities. The data science and education student group has just revamped their LinkedIn website, they do coding bootcamps and other events. You should check it out and send them a message if you want to join. Also Columbia Data Science Day events are coming up, register at the link to hear a speech on data privacy and ethics from former Google CEO Eric Schmidt on September 14. 

# I have also posted a link to the first assignment on GitHub. That will be due September 17 and is all about setting up your software environment, using R Markdown and practicing version control.

# The next two videos posted today will cover an introduction to R, RStudio, Git and GitHub. Next Tuesday we will talk about R Markdown and try to figure out what exactly educational data mining is?

1.3 R & RStudio

R & RStudio

Video Transcript

# I'm not going to lie to you, R is a pain in the ####. Especially if you are a computer scientist or you have learned another programming language that has clearly defined standards and rules. 

# You see, unlike those languages R is more like a real language, like Chinese or Spanish, where everyone can make up whatever they like, and a consensus is negotiated. R hasn't been regulated and standardized to the same degree as languages like C and Python - it is a collective effort. This approach has strengths and drawbacks, it is a very innovative language with new functions constantly being developed. But it also means that there are a thousand ways to do anything and it doesn't comply with computer science standards. So people who have studied computer science find it very illogical. If it has a unifying structure it is that it is designed for people to learn statistical methods while they code them. This is because it was built by statisticians, originally developed from the S language that was started at Bell Labs to tackle data problems in 1976. R was a free version of S developed at the University of Auckland, NZ in 1995 by Ross Ihaka and Robert Gentleman although the original developer of S, John Chambers has also been heavily involved in R.

# So let's say you want to learn about t-tests and you want to try them out, you can easily get inside the code to see what is going on and understand the program from a statistical point of view. And as far as statisticians are concerned R is the bee's knees. All I will say is that although it will cause you a lot of frustration you will learn to love it.

# R ranks 8th on The Importance Of Being Earnest popularity list of programming languages. A lot of students ask me whether they need to learn R or Python and the unfortunate truth is you probably need both to work in data science. They are becoming more similar but they do have strengths and weaknesses. Python is a lot more general, you can build a greater variety of things in Python more easily, but if you are doing heavily statistical work R is the preferred tool. I have placed the official R documentation on course website in case you need some help going to sleep but we will.cover basic R syntax and programming in the coming weeks.

# You can start R from the command line if you like and do your work directly in a terminal but a few years ago different Interactive Development Environments were created, these pull together a bunch of useful functions into a graphical user interface. We will be using one of these developed by RStudio. As a brief introduction here is the RStudio interface which is made up of four main panels: the file panel, the console, the environment and the project file manager. The file panel is where you load and work on permanent code files, this could be an R script, a markdown document or a full blown program. The console is where your code is run and acts like a history of all your executed programs, try not to use the output as the place where you run code, you're better off working on a specific file so you can keep track of your work. The environment shows you any objects that you have created and that R can manipulate. Most commonly a what is called a dataframe. And then the file tree shows you the files you have available.
 
# One important point is that RStudio organizes your work into projects, a project is a collection of files, code and output. You should always work in a project, don't just have one project that has everything you do in it. This will save you time in the long run and help keep you organized.
 
# Instructions for installing R and RStudio are located in Assignment 1 that is posted on the course website. You should try to install them and get them up and running now before you watch the next video.
 
# If you have any comments or questions feel free to ask them in the comments section on YouTube or sign up for office hours.

1.4 Git & Github

Git & Github

Video Transcript

# Now onto probably the most confusing software we will use this semester - if you can get your head around Git and GitHub - which you will - everything else will be super easy.
 
# I will try to give you the simplest breakdown possible to start with and then we can make it more complicated as the semester progresses. 
 
# Git Is what is called version control software. It is like track changes but for every file on your computer. It was created by Linus Torvalds in 2005 to help manage the development of the Linux Kernel project.

# It is designed to do a couple of things, keep a record of changes to files and allow collaboration by preventing people from overwriting each other's work. So it imposes a series of steps to make an official change to a file. The steps can be annoying, but that is a feature not a bug. It wants you to think about the consequences of any change you make. 

# Git lives on your computer and you can access it directly from the command line. We are not going to use it this way as it is integrated into RStudio. The important idea you need to keep in mind is that git tracks changes to files and forces you to go through a series of steps to make and save a change. 

# GitHub, now owned by Microsoft, was started in 2007 and allows the Git system to work across the internet. GitHub keeps a copy of your Git records of your file changes in the cloud. This allows there to be a backup of your work and for you to collaborate with others remotely.

# GitHub also serves another purposefor us and that is to be a portfolio of your work. This allows you to show prospective employers that you can do things, it provides a record of your work that is better than a line on a resume saying, I know R. You can show people, here are some projects I have done in R. I would like you to fill in the profile page and add a picture. It is worth investing time into maintaining your portfolio as it can be the difference between landing an interview and not.

# Now I want to show you the steps involved in making a change to a file using Git and GitHub through RStudio. And remember the steps are there for a reason - to stop you making mistakes, they will slow you down but that is on purpose.

# Let's say you are happily working on a file in a working directory on your laptop and you decide you would like to use Git to keep a record of your changes. Git consists of a local repository on your computer of all the changes you are making to your files. It also has a staging area where changes live that have not been committed to the record in the repository. When a change is in the staging area it has not been saved to the official record of changes yet. This is what happens when you just save the file as you would normally. To make an official copy in git you need to commit the change to git. Remember the word commit it is important.

# Luckily all this lives within RStudio and is easy to manage. But let's say you also want to backup your changes to the cloud - that is where GitHub comes in. GitHub is a remote copy of your local repository. But to make changes to this remote version you have to push the changes from your local version. Less commonly you can also pull changes from the remote repository to your local version.

# Now let's say your friend is working on a project and they want to collaborate with you. They send you the address for their local repository and you can fork it - which means make a copy from their GitHub account to your GitHub account. But it isn't on your local machine yet, so you will need to clone the repository from your GitHub account to your computer. Then you can make changes, commit them and push them back to your GitHub account. 

# If you have changes that you think are important and you want to share with your friend you can issue a pull request to have those changes made to the original copy of the repository. Importantly your friend can reject these changes or accept them. This is how collaboration works in GitHub, someone owns the original and they control whether to allow changes to it.

# Those are the basic ideas, you should now go follow the directions in Assigent 1 to install Git and GitHub.

It is a good idea to invest time in your Github profile, please fill out the basic information and add a picture.

1.5 Git & Github Walk Through

Practice Exercise

Video Transcript

# To get some practice with this process I have made a repository on GitHub at the core-methods-in-edm/github-practice. Once you have completed the steps in Assignment 1 to install everything you should practice the following steps.

# First you need to fork the repository to your account. Then click on code and copy the URL. Now open RStudio and go file, new project, verizon control and choose Git. If this does not work for you try some of the trouble shooting advice in Assignment 1.

# Paste the URL into the Repository URL box, make sure the project has a name and then locate the project somewhere where it can live permanently. Don't put it on the desktop or the downloads folder. Make a folder for the class or something like that.

# You have now cloned the Repository. Open the readme file in the files window. And make a change to it. Save your change and then click on the Git icon - either at the top left or the tab in the right panel.

# Click commit to open the Git window. Here you will see the changes that are being recorded. Check the box next to the files you want to record a change for and then write a message that describes that change. It will not let you proceed until you have written a message. This is so collaborators know what is being changed and why. Then hit commit. This records the changes in Git on your computer. To record change in GitHub you need to press push. Now if we go back to GitHub we can see that the changes have been recorded. 

# Finally, to create a pull request to make a change to the original copy you click the pull requests tab and then the green new pull request button and follow the prompts.

# So now go complete the installation and try to make a change to the readme file in the github-practice repository. Don't worry if you get stuck, we will be spending plenty of time making this work.

1.6 Assignment 1

Part A of Assignment 1 involves installation and setup of software. Please try to complete Part A by Tuesday September 8. If you run into any problems don’t hestitate to sign up for office hours.

Week 2 RMarkdown & R

  • This week we will look at the markup language RMarkdown and get started learning R
  • I have posted Part B of Assignment 1 that is due on September 17 midnight EDT
  • Videos are now available from Tencent for local access in China
  • The first code workout is on Thursday at 8:30am EDT/5:30am PST/8:30pm CST, the Zoom link is in the calendar (You will need to be logged into your TC Zoom account to access)
  • The five Hackathons have been posted in the class Google calendar

2.1 Intro to RMarkdown

RMarkdown

Video Transcript

# Good morning, good afternoon, good evening wherever you are in the world. 

# This lesson is about R Markdown, a tool we will be using to communicate our ideas about data in this course. RMarkdown documents will be the primary way you submit your assignments for example.

# Before we talk about markdown language we need to talk about markup languages. Markup languages are designed to provide formatting guidelines to files, with the intent that they will be invisible when the document is viewed by a user. They are analogous to the way manuscripts would be marked up by editors to give feedback to authors in the olden days.

# The most famous markup language is HTML, Hypertext Markup Language, that renders pages viewed in web browsers. HTML has presentation semantics that allow for the standard presentation of elements on web pages. Other markup languages you may be familiar with are Latex or XML.

# This idea of a markup language was extended to a markdown language by John Gruber and Aaron Swartz in 2004 to provide a "humane markup language" that could convert plain text into HTML. It is much simpler than HTML, hence mark-down, but it has had wide spread use specifically by our old friends at GitHub. In fact, if you did the exercise last week and edited the Readme file then you have already written markdown.  Markdown was primarily developed to encourage people to edit readme files and communicate on discussion boards in a way that was simple but enabled more visual variety, such as bold and italic text.

# R markdown is a flavor of markdown developed by RStudio to aid the communication of data science ideas. It builds on the Markdown syntax to allow rendering of several different outputs including static HTML pages, PDF documents, Microsoft Word documents, presentations, dashboards and others. As with regular markdown there is an emphasis on simplicity of execution, so that things can look good without the user having to know JavaScript or HTML.

# To install Rmarkdown within RStudio go to the packages tab, click install and search for the RMarkdown library. The words library and package are pretty interchangeable in R but strictly speaking they are different things. A package is a collection of code that extends the functionality of R - in this case allowing R to act as a markup language. A library is the location where the package is stored.

# You can also type the code install.packages("RMarkdown") however, RStudio seems to be more reliable if you go through their package manager rather than using the code.

# The RMarkdown package has a lot of what are called dependencies. Which means it relies on a lot of other packages, so you may need to wait a while as it installs all of them. Also, if it isn't working you may need to install each missing package one at a time.

# Once RMarkdown is installed. You should have access to a new file type under file, new file: RMarkdown.

# If we open one it will take us to a screen that asks what kind of output file we are creating and some information that it will prepopulte the file with - all of this can be changed later so it doesn't matter that much what you put in.

# When you open an RMarkdown file it comes preloaded with some information, the top is just the information you provided on the previous screen but below this there is a short explanation of how to use the RMarkdown file with some examples of the functionality.  

# The basic structure of a Markdown file is that it has code blocks in gray and text blocks in white. Anything written in the white area will be rendered as text, while anything in the gray box will be run as R code. The idea is to run your code with explanation surrounding it to better communicate your ideas about the data. You can create a code block with the code Three grave accents or backquotes, curly bracket lower case r close curly brackets. And you end a code block with three grave accents. You can also use the insert menu or the shortcut option + command +i on a Mac or ctrl + alt + i on a Windows machine. It is definitely worth learning the shortcut on that one.

# Everything within a code block must obey R syntax and rules and everything outside a code block can obey Markdown syntax and rules. A cheatsheet for the markdown syntax can be found on the course website. Some of the handy functions are headings, bullet points, inserting images or videos and tables. There are also styles available that allow you to change the look and feel of the output.

# When you want to render your masterpiece you can click "knit" and the document will be created. If it is a pdf it will be saved in your current working directory.

# Have a play around, in the next section we will use a Markdown document to look at the class survey data.

2.2 Visual Analysis of Entrance Survey Results

Entrance Survey

2.3 Code Workout 1

Code Workout

  • Passcode: kwjse72?

Week 3 Introduction to the Field(s) and Swirl

  • DataCamp: For students looking for more coding practice we have a free account with DataCamp. Access through this link - your sign up email address must be your TC email address.

3.1 Fields of Study

Fileds of Study

Video Transcript

# Good morning, good afternoon, good evening wherever you are in the world. Today we finally get to move past the setup stage and into some content.

# We will start with an introduction to some of the various intersecting fields that have grown up around technology, data and education. In particular, Artificial Intelligence in Education, Educational Data Mining, Learning Analytics and the new kid on the block Learning Engineering. There are other fields such as cognitive science and learning science that we could also include but we have to draw the line somewhere. We could also talk about educational data science but this term is so broad and used in so many different ways that it kind of encompasses everything.

# We will start with "AI in Ed",  the oldest of the set. Education was part of the original conception of Arificial Intelligence and essential to its early development as a field. At the Dartmouth Conference in 1956 where AI and it's core mission were codified the idea was promoted that sufficiently complex machines would have to learn in similarly complex ways to how children do. Therefore education and AI were einextricably linked. However as this idea became less popular the educational and engineering missions diverged. In this period AI in Education solidified as a field and over the 1980s pursued two main research agendas: 

# 1. Development of AI that could be applied to classroom practice and
# 2. Development of AI to measure, understand and improve learning.
# The original proponents of AI in Ed made many bold claims that the new field would revolutionize, challenge and improve education. Much of this did not come to pass, but this is not to say that nothing was achieved. The Development of intelligent tutoring systems owes a lot to AI in Ed and also many cognitive models that explain interactions between learners and machines. But as a political enterprise to change the way that education systems operate, AI in Ed was unable to deliver on its promises. However, it is still a dominant field in education technology, with a society, journal and conferences with a strong European presence. The revolutionary aspects of the field have currently moved more into the private sector, and geographically towards China and India, broadly under the label AI in EdTech.

# The next field that we will discuss is Educational Data Mining or EDM. EDM arose from the field of Knowledge Discovery in Data, an area that evolved with the growth of sophisticated database and data storage systems. At some point people realized that all the data that was being stored by companies could be queried or mined for insights and patterns. Data that had previously seemed innocuous and useless took on new meaning and people were able to throw relatively simple algorithms at large data sets to see what might come back. EDM was the result of applying these methods principally to intelligent tutoring system and educational game log files. EDM tackles education technology from a predominantly computer science perspective, with a. Focus on prediction and clustering algorithms, relationship mining and distillation of data for human judgement. The field is. Represented by the International Society of Educational Data Mining which holds a yearly conference and produces the International Journal of Educational Data Mining.

# Arguably the largest by current membership is the field of Learning Analytics. Learning Analytics arose predominantly with the growth of data collection within higher education. People quickly realized that business intelligence software could be used to analyze data from student information systems and learning management systems. This logic sprouted a lot of research and product development across a variety of disciplines, from predicting student dropout to the ethical concerns of data collection. The size of the learning Analytics community is facilitated by the breadth of concerns and methodologies employed in it. It is a broad church with both quantitative and qualitative reserachers working within it and a lot of industry, university and government partnerships. It tends to be more systems level though, with less work on individual level problems such as is found in educational data mining. Whereas EDM may be concerned with the efficixay of a particular algorithm, LA will be concerned with what happens to the classroom or school when the ITS is implemented. As with AI in Ed and EDM, LA has a society, a journal, a handbook and multiple conferences and webinars.

# The new kid on the block is learning engineering which remains somewhat mysterious but has important advocates such as the Schmidt Foundation, the Chan Zuckerberg Foundation and Carnegie Mellon University. Learning engineering appears to be concerned with the same problems as AI in Ed, EDM and LA but from an engineering perspective. I have included links to an email thread from the Learning Analytics Google Group that gets a bit heated - don't take it too seriously people are being very sarcastic in parts, but it is revealing to see the differences and similarities between people in each area. I have also included a much more diplomatic article written by George Siemens the person who initiates the email thread discussing the differences and similarities between LAnand EDM with the Ryan Baker, one of the main voices within EDM and the founder of the Learning Analytics Masters program here at TC.

# I don't think you should be pledging your alliagence to one field or another, that is not a good use of your energy but I do think you need to be able to talk to people who see themselves as firmly located in one field or another. Go take a look at these two arifacts and then let me know what you think in the comments, do you see yourself as belonging to one of these fields in particular or you can't really see the difference between them?

3.2 Code Workout 2

Code Workout

  • Passcode: PX.5D5*r
  • Closed captioning is enabled for the Code Workout
  • Correction: code for table was incorrect, was missing horizontal dashes, correct code:

| One | Two | Three |
|-----|-----|-------|
|11111|22222|3333333|

Week 4 Creating Community in Online Spaces

4.1 Hackathon 1: Counting Community

  • The first hackathon for the semester will be held Tuesday 9/22 8pm EDT/5:00pm PST/Wednesday 9/23 8:00am CST/5:30am IST
  • The Zoom link has been changed, please refer to the most recent email for the Zoom link. You must be logged into MyTC to access the Hackathon - you will be allocated to breakout groups based on your TC email

Hackathon 1

Video Transcript

# 00:08:24.540 --> 00:08:33.720
# Charles Lang: Today what we're going to do is talk a bit about community building in online classes and this is what the next couple of hours is going to look like.
# 
# 13
# 00:08:35.220 --> 00:08:48.240
# Charles Lang: I not a huge fan of lecturing via zoom. So we'll try and keep the intro pretty quick. I'm actually going to. So the first 20 minutes or so will be intro. We'll do a bit of background on
# 
# 14
# 00:08:49.770 --> 00:08:59.610
# Charles Lang: Community online communities in education and then I'm going to give you a bit of a break them because it looks like a lot of people have logged in, without
# 
# 15
# 00:09:01.020 --> 00:09:18.960
# Charles Lang: Without the TC email. So I want to put you into groups, but I'm gonna have to allocate those people manually. So then we'll take a five minute break and then we'll head into an activity where I'm going to get you to in groups, try to solve an educational problem.
# 
# 16
# 00:09:20.310 --> 00:09:31.590
# Charles Lang: Around building community and online classes and you'll do that in your group for about 40 minutes and the deliverable from that session will be as short presentation
# 
# 17
# 00:09:32.850 --> 00:09:34.620
# Charles Lang: Then we'll take another little five minute break.
# 
# 18
# 00:09:35.640 --> 00:09:52.620
# Charles Lang: You can get some tea or a cookie. Wherever you feel like, and then we'll come back and I'll merge the groups together in so two groups together and you will present to each other what your solution is for this educational problem. I'm going to pose to you.
# 
# 19
# 00:09:53.760 --> 00:10:05.070
# Charles Lang: And choose one of the groups to go forward and so they will all come back together for the end when we'll listen to a
# 
# 20
# 00:10:06.510 --> 00:10:14.610
# Charles Lang: A listen to the six groups that are left present and then we'll turn on group whose solution.
# 
# 21
# 00:10:15.630 --> 00:10:30.510
# Charles Lang: The teaching team will take and we will convert into your second time. That's what that averting is for us to choose the topic or the solution around online community building that you will be
# 
# 22
# 00:10:31.470 --> 00:10:35.040
# Charles Lang: Your second assignment. If you tune that will be handy.
# 
# 23
# 00:10:36.330 --> 00:10:43.500
# Charles Lang: quite hard for me to present and mute you. If you can stay on top of that makes my life easier. Thank you very much.
# 
# 24
# 00:10:44.970 --> 00:10:51.660
# Charles Lang: Okay, so that's that's the next couple of hours. Hopefully that makes sense. We'll have a Q AMP a in a in a bit.
# 
# 25
# 00:10:53.040 --> 00:11:00.090
# Charles Lang: So the next thing that I want to talk about is is today's about like how do you build community.
# 
# 26
# 00:11:00.870 --> 00:11:20.160
# Charles Lang: online spaces. So this idea of Community exists everywhere else. And you kind of hear about it a lot in terms of platforms to online communities as platforms probably all participate in some form. But there's a whole world, a whole kind of research endeavor and a whole software.
# 
# 27
# 00:11:23.190 --> 00:11:32.730
# Charles Lang: ecosystem around developing online groups of learners and developing communities of learners and that's kind of what we want to talk about today.
# 
# 28
# 00:11:34.110 --> 00:11:37.170
# Charles Lang: So I'm gonna throw out a poll
# 
# 29
# 00:11:46.530 --> 00:11:47.640
# Charles Lang: Some common questions.
# 
# 30
# 00:11:48.690 --> 00:11:50.610
# Charles Lang: That are often asked of students.
# 
# 31
# 00:11:51.720 --> 00:12:01.830
# Charles Lang: And first question is in community important in online learning. And the second question is, who is most responsible for that community in an online class.
# 
# 32
# 00:12:03.240 --> 00:12:07.980
# Charles Lang: If you guys want to start answering. We'll see what the one answers we get
# 
# 33
# 00:12:45.630 --> 00:12:47.010
# Charles Lang: OK, so
# 
# 34
# 00:12:50.400 --> 00:13:01.770
# Charles Lang: The this. The first question tracks almost 100% with what other studies would say it's that students think that community is super important for online learning.
# 
# 35
# 00:13:07.470 --> 00:13:22.980
# Charles Lang: There's a, there's a kind of reasonable explanation for this, which is that online education feels a bit isolating and the students are really looking for assistance when they're trying to learn and that's harder and online spaces and the community can fill that gap. So, that makes sense.
# 
# 36
# 00:13:24.210 --> 00:13:31.020
# Charles Lang: Beyond that, there are a lot. There's lots of conflicting research. So everyone kind of has their own version of this. Whoops.
# 
# 37
# 00:13:34.440 --> 00:13:37.710
# Charles Lang: share the results with you. That's what results look like
# 
# 38
# 00:13:40.290 --> 00:13:47.160
# Charles Lang: So hundred percent, there's one person who's, who's, you know, and actually they own the idea that I don't think communities and one
# 
# 39
# 00:13:48.990 --> 00:13:53.820
# Charles Lang: Good on you for being mean especially and then on the second piece.
# 
# 40
# 00:13:55.290 --> 00:14:06.180
# Charles Lang: What people have tended to find is that learners with very little experience in online education tend to think that the instructor is responsible
# 
# 41
# 00:14:06.570 --> 00:14:20.730
# Charles Lang: For the community building within the online class and then students once they get above about one or one and a half years of online learning experience. They tend to take liability themselves and say that the
# 
# 42
# 00:14:20.730 --> 00:14:30.150
# Charles Lang: Student Is most responsible for the online community. Now these aren't necessarily the wrong answer right like learners with little
# 
# 43
# 00:14:33.120 --> 00:14:34.380
# Charles Lang: Yourself, whoever just joined
# 
# 44
# 00:14:42.300 --> 00:14:43.290
# Charles Lang: The third
# 
# 45
# 00:14:46.500 --> 00:14:47.640
# Charles Lang: Okay, I'm gonna meet you
# 
# 46
# 00:14:58.680 --> 00:14:59.130
# Charles Lang: There we go.
# 
# 47
# 00:15:00.540 --> 00:15:14.250
# Charles Lang: So let us have a little experience. Probably need more direction from the, from the instructor. They need direction on how to create community so that kind of makes sense. And then, once people get the hang of it. They kind of feel that they are responsible for
# 
# 48
# 00:15:15.810 --> 00:15:16.560
# Charles Lang: So,
# 
# 49
# 00:15:18.000 --> 00:15:30.450
# Charles Lang: Like I said, there's a lot of conflicting research around how to actually build community. So how there's no kind of recipe for how to build a good community online or otherwise, I suppose, everyone would do it. There are a few
# 
# 50
# 00:15:31.620 --> 00:15:43.920
# Charles Lang: Popular flavors, one of which is advanced is a kind of doing a constructivist view and for people who are who have less experience and education.
# 
# 51
# 00:15:45.180 --> 00:15:53.760
# Charles Lang: If he is a few kind of features of constructivist pedagogy, which is that when is a complex and they bring personal experience to their
# 
# 52
# 00:15:54.180 --> 00:16:07.350
# Charles Lang: To their learning. And that's important that new knowledge and skills have to be incorporated into that previous experience into those into a previous knowledge that students on a blank slate, they have
# 
# 53
# 00:16:08.610 --> 00:16:11.790
# Charles Lang: They think things before they turn up in your class. Shocking, I know.
# 
# 54
# 00:16:13.410 --> 00:16:23.880
# Charles Lang: That the constructor has really put the responsibility of learning on the students and the instructors there to facilitate the student learning not to dictate the learning to the
# 
# 55
# 00:16:25.020 --> 00:16:31.050
# Charles Lang: To the student and therefore when we're thinking about a community of inquiry in a constructivist sense
# 
# 56
# 00:16:32.340 --> 00:16:49.500
# Charles Lang: The, the community needs to be driven by the learners and the kind of social support of learners is important in order to help them bridge between their experience and whatever, then, you know, enjoy new experiences and that community should be there to help them.
# 
# 57
# 00:16:51.210 --> 00:16:55.230
# Charles Lang: So here is a popular model.
# 
# 58
# 00:16:56.460 --> 00:17:11.250
# Charles Lang: Of this particular way of thinking about communities, which is called the community of inquiry model and it's made up of three complete pieces, which is that in any community of inquiry in any community of learners, we have
# 
# 59
# 00:17:12.360 --> 00:17:26.070
# Charles Lang: The social presence of the learner. So how does the Lana project themselves into the community and then how good or bad are they at developing relationships within that community.
# 
# 60
# 00:17:26.760 --> 00:17:33.480
# Charles Lang: Then we have the cognitive piece which is really shorthand. In this sense, it's not so much the psychology, cognitive
# 
# 61
# 00:17:35.520 --> 00:17:36.510
# Charles Lang: Thing but it's really
# 
# 62
# 00:17:38.070 --> 00:17:46.140
# Charles Lang: Putting a lot of emphasis on understanding. So how well it is the learner understanding both content and other learners in that space.
# 
# 63
# 00:17:46.560 --> 00:17:58.530
# Charles Lang: And then the third piece is the teaching presence, which is really not just like how good is the teacher teaching, but how well it is the whole environment setup. How well is it designed is it well thought out.
# 
# 64
# 00:17:59.640 --> 00:18:13.680
# Charles Lang: From that perspective I've put the link to a nice reasonably short review article on the webpage for the course. So it's just below the slides, which are also up on the website.
# 
# 65
# 00:18:16.230 --> 00:18:25.950
# Charles Lang: So we can look at a community experience with that of course by looking at the comments that I had you make on the most recent video
# 
# 66
# 00:18:26.460 --> 00:18:38.520
# Charles Lang: We can think about this in these three different ways which are so think about this in terms of social presence, well, maybe this isn't a great way of
# 
# 67
# 00:18:39.300 --> 00:18:46.320
# Charles Lang: Creating connections between students because it if you read through these and I would encourage you to go and read through them.
# 
# 68
# 00:18:46.980 --> 00:18:51.900
# Charles Lang: They are very uni directional, they are. This is what I do. And there isn't much
# 
# 69
# 00:18:52.470 --> 00:18:59.280
# Charles Lang: Back and forth between people. So there's no relationship building there, but there is a sense that you have to own your presence. So we've got something there.
# 
# 70
# 00:19:00.090 --> 00:19:06.690
# Charles Lang: In terms of understanding. Well, most of that will be based around whether you've translated what you've watched in the video into a comment.
# 
# 71
# 00:19:09.030 --> 00:19:15.480
# Charles Lang: And so we might look at some kind of measurement around that. And then the third piece is like
# 
# 72
# 00:19:15.990 --> 00:19:26.700
# Yuan Chang: Excuse me, Dr. Brown. Yes, I'm just want to make sure, just want to confirm, you're not presenting anything on your screen or are you because we're not
# 
# 73
# 00:19:26.730 --> 00:19:30.150
# Charles Lang: Seeing your screen. Oh, it's telling me that I am I just
# 
# 74
# 00:19:32.040 --> 00:19:40.470
# Charles Lang: That's not good. Hold on, let me stop and start again. Thanks for telling me. Yeah. Jumping anything if it's not matching up definitely
# 
# 75
# 00:19:41.070 --> 00:19:46.200
# Malik Muftau: And I think you need to close. You need to stop sharing the poll and then you're going to be able to share the screen.
# 
# 76
# 00:19:47.160 --> 00:19:47.940
# Charles Lang: Am I still sharing
# 
# 77
# 00:19:51.240 --> 00:19:52.800
# Charles Lang: Um, it's still sharing posts.
# 
# 78
# 00:19:54.420 --> 00:19:56.160
# Yuan Chang: I don't think so. No we're not. So,
# 
# 79
# 00:19:58.500 --> 00:19:59.910
# Charles Lang: Okay. So now, how we doing
# 
# 80
# 00:20:02.220 --> 00:20:03.090
# Paolo Rivas: No good. No.
# 
# 81
# 00:20:03.150 --> 00:20:09.360
# Charles Lang: No. Okay, perfect. Thank you very much. Yet, don't ya don't just listen to me ramble on dear Lord.
# 
# 82
# 00:20:11.100 --> 00:20:21.570
# Charles Lang: Please feel free to jump in. Thank you very much. Okay, so if you were looking at it before. This is the just a screenshot from the YouTube
# 
# 83
# 00:20:22.290 --> 00:20:28.800
# Charles Lang: Where I had to make some comments underneath where you were explaining and if you haven't done that, I would encourage you to do it and also read other people's face.
# 
# 84
# 00:20:29.280 --> 00:20:39.180
# Charles Lang: And then if we want to go back for a second and look at if we wanted to kind of analyze this activity in terms of a community of inquiry.
# 
# 85
# 00:20:39.720 --> 00:20:48.090
# Charles Lang: We would look at social presence, which maybe you might think, you know, there is a an opportunity to project yourself but there's not a lot of opportunity to develop relationships.
# 
# 86
# 00:20:48.870 --> 00:20:55.800
# Charles Lang: And then cognitive presence. Well, it's all basically around the understanding that you might be drawing from the video and then trying to
# 
# 87
# 00:20:56.460 --> 00:21:09.540
# Charles Lang: Reach out say how that relates to yourself and then the teaching presence, really, that would be not just like whether this is a well constructed video, but it will also be like is this design.
# 
# 88
# 00:21:10.620 --> 00:21:19.680
# Charles Lang: Doesn't make sense to develop a community. So he is having a linear list of comments, the best way. How are those comments ordered
# 
# 89
# 00:21:21.810 --> 00:21:31.770
# Charles Lang: Does it. Is it a kind of an upward system which there is. But we're not really using it, that kind of thing. So that's how you would kind of analyze one particular flavor.
# 
# 90
# 00:21:32.430 --> 00:21:44.520
# Charles Lang: Of community and how you might think about an at large framework. Like I said, I put that article bit is many others. There's probably four or five dominant ways of
# 
# 91
# 00:21:44.940 --> 00:21:58.710
# Charles Lang: Analyzing communities in online spaces at Justin education. And then if you move out into marketing or other other fields and there's a bunch there too so huge area, not a huge amount of agreement on what best practice should be
# 
# 92
# 00:22:00.000 --> 00:22:05.520
# Charles Lang: But what I want you to do today is to apply a framework that I'm about to give you, and that is
# 
# 93
# 00:22:08.430 --> 00:22:12.330
# Charles Lang: I don't want to understand how important it is, if you are going to go forward and do this work.
# 
# 94
# 00:22:12.810 --> 00:22:25.110
# Charles Lang: This is what we're going to consistently work on for the rest of the semester because it really makes the difference between whether you will be successful or not in this field, and it is a workflow.
# 
# 95
# 00:22:26.910 --> 00:22:41.190
# Charles Lang: A way of thinking about a data specific problem. You can be a very good you can be very good at building models, but if you are not on top of this workflow, you are going to struggle in the workplace, because
# 
# 96
# 00:22:41.610 --> 00:22:49.350
# Charles Lang: The kind of application of sophisticated models to very large datasets ends up being a very small part of your job.
# 
# 97
# 00:22:49.830 --> 00:22:54.630
# Charles Lang: In any place in education. Now if you are a super hot shot and you've got
# 
# 98
# 00:22:55.050 --> 00:23:03.360
# Charles Lang: A PhD in particular in compositional neural networks, then maybe you get to do it all day in a very large company, but for the most part, you need to be involved in a full
# 
# 99
# 00:23:03.960 --> 00:23:20.340
# Charles Lang: Data Science workflow. So that's what we're going to look at today. So the workplace starts with data and start with a very simple question that trips people up constantly, which is what am I trying to count.
# 
# 100
# 00:23:21.600 --> 00:23:36.390
# Charles Lang: It turns out that's actually a really hard thing to answer. If, when I go to work with schools in particular, people think of obvious things but they can't really think of the things that they care about and how to count them. So everyone will say,
# 
# 101
# 00:23:37.950 --> 00:23:46.170
# Charles Lang: Test scores, everybody will say disciplinary data. But then, but that's not necessarily what they care about. So figuring out the thing, the accounting is harmed.
# 
# 102
# 00:23:46.680 --> 00:23:54.210
# Charles Lang: So here's an example, maybe I want to count the number of raise hands. Next thing you can need to do is to process that data somehow so
# 
# 103
# 00:23:54.690 --> 00:24:05.790
# Charles Lang: How what how do I make meaning from that count like is half the class raising their hands, a good sign or a bad sign is no one raising their hand liquid sign or events on
# 
# 104
# 00:24:06.900 --> 00:24:15.600
# Charles Lang: And then, so I'm going to process that count somehow. And now I'm going to decide, well, what does that processed count mean
# 
# 105
# 00:24:16.410 --> 00:24:26.700
# Charles Lang: So what information is it giving me is it giving me information that the class understands exactly what I'm saying is that what I need to infer
# 
# 106
# 00:24:27.330 --> 00:24:36.030
# Charles Lang: And then the last thing is. So what's the action that comes in response to that influence and this seems super simple. I know, but
# 
# 107
# 00:24:36.690 --> 00:24:41.220
# Charles Lang: When you go into a workplace or into a research project. Maybe you're interested in
# 
# 108
# 00:24:41.850 --> 00:24:45.600
# Charles Lang: Getting on top of this. If you haven't gone on top of this, you are going to start
# 
# 109
# 00:24:45.960 --> 00:24:55.860
# Charles Lang: I promise. So this is what we're going to practice this is what we're going to practice today and we're going to keep practicing as we go forward. We're going to get more and more sophisticated in the ways that we deal with each of these buckets.
# 
# 110
# 00:24:57.120 --> 00:25:10.980
# Charles Lang: So we kind of come at an interesting time because also at the same time, people are making decisions about whether a human or machine should be doing each of these things and companies are turning up
# 
# 111
# 00:25:12.750 --> 00:25:23.640
# Charles Lang: With products that can replace people in each of these buckets. So we have to make a decision. Should we replace this this teacher or this administrator or this
# 
# 112
# 00:25:24.960 --> 00:25:26.700
# Charles Lang: You know caretaker with a machine.
# 
# 113
# 00:25:30.990 --> 00:25:40.980
# Charles Lang: What does this look like. So these buckets, as they don't come from nowhere. These are actual actually verticals within which companies seem to operate. Now there are companies that go across.
# 
# 114
# 00:25:41.340 --> 00:25:57.660
# Charles Lang: But there are companies, companies tend to try to pick a lane. So they are either predominantly a data generating and data, keeping operation like the LM S which is really a it's really a database right within within interface on top of
# 
# 115
# 00:25:59.370 --> 00:26:09.840
# Charles Lang: A company clever. You may be familiar with their whole idea is to make single sign on easy for education so that you can
# 
# 116
# 00:26:11.460 --> 00:26:20.040
# Charles Lang: like you do with your unique sign into one website and then be able to access a lot of other products. That's what, that's the service that they provide.
# 
# 117
# 00:26:20.850 --> 00:26:33.660
# Charles Lang: On the process side on the product turning, turning counts into a fame. If you think of something like tenant, which is trying to count plagiarism. For example,
# 
# 118
# 00:26:34.710 --> 00:26:41.580
# Charles Lang: Or this was originally if you know Newton. Newton. This kind of almost closed down. But he was trying to
# 
# 119
# 00:26:42.720 --> 00:26:45.330
# Charles Lang: Use neural networks, mostly to
# 
# 120
# 00:26:46.440 --> 00:26:52.500
# Charles Lang: Figure out things in LM s and other data sets that they amassed
# 
# 121
# 00:26:53.520 --> 00:27:02.130
# Charles Lang: And then on the knowledge piece. This is where you might find one of your traditional publishing companies like PSN they are really trying to
# 
# 122
# 00:27:02.820 --> 00:27:15.270
# Charles Lang: Figure out what the Council mean and create an influence. And then lastly, there are a bunch of people who can offer you some kind of automated process at the end.
# 
# 123
# 00:27:15.840 --> 00:27:23.490
# Charles Lang: So that's the thing that I want you to work on today, I'm going to put you into birds. I'm gonna have to take five minutes to do that so
# 
# 124
# 00:27:24.510 --> 00:27:31.920
# Charles Lang: In a minute, you can go grab some tea or a cookie or something. And I'll make sure you allocated to the right groups and then
# 
# 125
# 00:27:32.460 --> 00:27:40.920
# Charles Lang: What I want you to do is think of an activity that promotes community in this class. It could be one that already exists. So you can make one up totally fine.
# 
# 126
# 00:27:41.700 --> 00:27:50.880
# Charles Lang: I want you to find the data that generates that that activity generates and how you would count it, and then how you would process it.
# 
# 127
# 00:27:51.360 --> 00:27:57.600
# Charles Lang: And then what influence you are going to make from. And then, what action you can take in response to your influence
# 
# 128
# 00:27:58.290 --> 00:28:05.130
# Charles Lang: And you can create two or three slides to explain your ideas. And then in 45 minutes and go, have you present to another group.
# 
# 129
# 00:28:05.970 --> 00:28:24.360
# Charles Lang: Okay, so this is practice. I know this does seem simple until you actually try to do it. And then it is hard. And it really is the backbone of this work. So if you give me five minutes, I'll make sure your allocated to groups and then we can come back and start off. Okay.
# 
# 130
# 00:28:26.970 --> 00:28:29.610
# Charles Lang: But if you have any questions don't hesitate to yell out
# 
# 131
# 00:29:27.300 --> 00:29:29.550
# Rong Sang: Hey doctor. I have a question.
# 
# 132
# 00:29:33.870 --> 00:29:49.830
# Rong Sang: So I'm wondering about the rows of data in our activity because I don't know. So if, if we need to use the data to to do the interpretation. So are we supposed to use the data to
# 
# 133
# 00:29:51.030 --> 00:29:57.930
# Rong Sang: Make hypothesis of what we're going to do. Are we use a data to test where what we're going to do.
# 
# 134
# 00:30:00.480 --> 00:30:13.320
# Charles Lang: That's a good question. So what. So what I want you to do is get out of the frame that you are doing a psychology experiment, you are not trying to find the truth.
# 
# 135
# 00:30:16.080 --> 00:30:30.510
# Charles Lang: What you're trying to do is find something that might be actionable and it's a slightly different but very important distinction. So you were looking for some kind of patent. You're not trying to prove that that patent is true for
# 
# 136
# 00:30:31.680 --> 00:30:40.890
# Charles Lang: An average student you it is the patent exists, and we can get. We will get to more kind of sophisticated statistical measures.
# 
# 137
# 00:30:42.150 --> 00:30:46.110
# Charles Lang: But at the moment, what I want you to do is fine. Think of something that you could count.
# 
# 138
# 00:30:47.160 --> 00:30:56.940
# Charles Lang: And then think about what kind of patents would come out of that count. And then how you would have to process that count in order to see that. Does that make sense.
# 
# 139
# 00:30:58.620 --> 00:31:01.140
# Rong Sang: Yeah. So what's the connection between
# 
# 140
# 00:31:02.790 --> 00:31:07.770
# Rong Sang: Like the suggestions we will come up with about the community building.
# 
# 141
# 00:31:09.150 --> 00:31:16.260
# Rong Sang: Is to help us prove our like just give us guide to design
# 
# 142
# 00:31:17.430 --> 00:31:18.300
# Charles Lang: The second one.
# 
# 143
# 00:31:19.950 --> 00:31:20.460
# Rong Sang: Okay.
# 
# 144
# 00:31:23.910 --> 00:31:24.540
# Charles Lang: Does that make sense.
# 
# 145
# 00:31:25.530 --> 00:31:26.160
# Rong Sang: Make sense
# 
# 146
# 00:31:26.940 --> 00:31:27.780
# Charles Lang: Okay, great.
# 
# 147
# 00:39:43.410 --> 00:39:44.880
# Charles Lang: Right, time to go.

4.2 Code Workout 3

Code Workout

  • Passcode: 6^hDausG
  • Closed captioning is enabled for the Code Workout
  • Files used in this Code Work available here

4.3 Assignment 2

Assignment 2 involves data wrangling and visualization. Please complete Assignment 2 by Monday October 5 at midnight EDT. If you run into any problems don’t hestitate to sign up for office hours.

Week 5 Data Wrangling

  • Code Workout Thursday 8:30am EDT/5:30am PST/8:30pm CST/5:00pm IST

5.1 Data Wrangling I

Data Wrangling II

Video Transcript

# Good morning, good afternoon, good evening. This week I am providing a few quick videos on data wrangling. Hopefully to provide some context for you second assignment. 

# Data wrangling is the processing of data for use in analysis, there are many ways we might need to wrangle data depending on how it was acquired, its original format, how complete it is, how reliable it is and what what purpose we plan to use it for.

# Many of you may have heard the term “data cleaning”, data cleaning refers to the act of characterizing corrupt, inaccurate, or missing data and replacing or deleting this “dirty” data. Data wrangling involves these processes, as well as restructuring the data for a specific purpose, such as an analyses or so a computer can make sense of it. An extreme example is “unstructured data” such as audio files containing speech, this data needs to be structured in a way that a machine can analyze it. This might involve transcribing the audio to words, extracting volume data, or parsing the different people speaking on a single track. In all cases the data needs to be converted into a data structure to that can be analyzed such as a matrix of words or rows and columns of data in a data frame. 

# Another term used to describe data that is formatted specifically into rows and columns is “tidy”. Tidy data is in a structure that is “intuitive” to R. R is a “vectorized” language, which is a complex set of ideas to do with how computations are made and prioritized in R and this makes it different to other languages. The simplest explanation is that languages like Python prioritize row-by-row calculations whereas R prioritizes column-by-column calculations. 

# When we are wrangling, cleaning or tyding up data we want to preserve an accurate record of the actions we have taken so that we can reproduce them in the future and so it doesn’t look like we have made the data up. That’s why you need to learn to do all this in R, so there is a reproducible record of your work in code. It is always tempting to go back to excel and mess with the data if you are more familiar with that software, but you need to hang in there and figure it out in R. GUI buttons are not easily reproduced. 
# The issue as always in R is that there are 100 different ways to do anything. And there are several ways to reformat data. There are base-R ways and you have leaned the matrix notation to select columns and rows, there are also base R commands such as aggregate() and rbind(), but these are somewhat piecemeal and do not have consistent syntax. To deal with this issue several packages have been developed to put all the commands you may need when data wrangling into one place, the two main packages are data.table and the tidy verse. We will concentrate on the tidyverse in this class as it has more developed documentation and more people behind it. 

# The tidyverse started as a series of packages created by the prolific R developer Hadley Wickham. The tidyverse is made up of eight main packages including two that are important for wrangling: tidyr and dyplr. Tidyr contains a set of functions specifically for creating tidy data, data that is interpretable by R, dplyr provides a set of functions specifically for manipulating data in R, you can think of it as SQL in R if that makes any sense. 

# In the next video I am going to go over a few important functions that you will need in assignment 2. 

5.2 Data Wrangling II

Data Wrangling II

Video Transcript

Data Wrangling II

Video Transcript

# The first command that you should be familiar with is the %>% percent-greater than-percent command. These three characters together is called “pipe” and that is what it does. It “pipes” whatever is on the left hand side of the pipe to the right hand side of the pipe. If you have looked at the assignment then you will have seen this in action in the group_by() statement.

# Here we take a data frame D1 and pipe it into the group_by() statement.

# The group_by() command is the next command I would like you to be familiar with. Group_by() groups rows of data together according to the variable you assign as the grouping variable. For example, we can group students according to their grade with the command group_by(D1, grade). If you want to ungroup the data frame you can use the command ungroup(D1). 

# Previously we selected columns and rows using matrix notation but you can also use the tidyverse commands select() to select columns from a data frame. There are many modifications to this command that can make your life a littles easier so you don’t have to count column numbers. You can use the variable names, or different helper functions such as “contains” which will select the columns with a specific character in their name, ends_with() which selects the columns with names ending in a specific characters or starts_with() whose selects columns whose names start with a specific characters. 

# To select rows you can use the filter() command. The filter command selects rows according to some logical criteria such as score > 10. To select specific rows by position you can use the command slice() if you know the rows that you are interested in. It is important to note that the numbers along the left hand side of your data frame in RStudio are called the index numbers, these are attached to the rows to keep track of them. These index numbers are what you are referencing when you use the slice command.

# Finally we will take a look at a more complex command called summarize(). Summarize collapses a data frame into a single row. Usually you want to do this to make calculations down a column. For example, if I want to know the average of class scores I can summarize the class scores column. Keep in mind the new data frame will only contain the variable you mentioned in the command. This command is very useful when used in combination with the group_by() command for finding group averages or sums. 

5.3 Code Workout 4

Code Workout

  • Passcode: 0Rc4s^eC
  • Closed captioning is enabled for the Code Workout
  • Files used in this Code Work available here

Week 6 Social Network Analysis I

6.1 Hackathon 2: Intro to SNA

  • The second hackathon for the semester will be held Tuesday 10/6 8pm EDT/5:00pm PST/Wednesday 9/23 8:00am CST/5:30am IST
  • The Zoom link has been changed, please refer to the most recent email for the Zoom link. You must be logged into MyTC to access the Hackathon - you will be allocated to breakout groups based on your TC email

Hackathon 2

6.2 Code Workout 5

Code Workout 5

  • Passcode: q4^Ue81$

Week 7 Social Network Analysis II

7.1 Code Workout 6

Code Workout 6

  • Passcode: Wyl.3z^G

7.2 Adjacency Matrices

Adjacency Matrices

Video Transcript

# It looks like you have all made a lot of progress on assignment 3 over the weekend. Nice work. So we need to talk a little about data structures so you can complete the final stretch.

# Igraph requires data to be structured in two main ways. The first is the vertex and edge list that you are familiar with, the second is a little more complex and is called an adjacency matrix. 
 
# Before we look at adjacency matrices it is worth discussing the matrix format in R. A matrix in R is an extension of a vector into an extra dimension. It is like a dataframe with columns and rows but in a matrix all cells must be of the same type - columns cannot be different types as is possible in a dataframe. So if one cell in a matrix is numeric then all cells are. This is to allow matrices to perform linear algebra - a very powerful tool and particularly useful in doing large computations such as those used in social network analysis. Limits to computational power prevented the study of large networks for a long time and so saving computational energy through linear algebra was essential. 

# In particular, matrix multiplication allows for the conversion of matrices from one node dimension to another. For example, if we have a matrix of students by classes we can convert that to a matrix of class by class using matrix multiplication. We just need our matrix to be transposed so that students are multiplied by classes. If we use the dot product rule we can convert our student-class matrix into a class-class matrix. The R code for transposing matrices is t() and the code to multiply matrices is %*%
 
# In social network analysis the resulting data structure is called an adjacency matrix - it stores information about the relationships between nodes in the cells of the matrix. In this case, how many students each class has in common. Igraph will generate a graph from this information plotting each class as a node and each cell as an edge. This can be accomplished by using the code: graph_from_adjacency_matrix() or graph.adjacency(). An important option to provide to igraph is what to do with the diagonal combinations of the same class. In many cases the combination of a node and itself has no meaning so the option diag = false will ignore them.
 
# The take aways here are that matrices are a convenient way to store network data and make available linear algebra to us to help manipulate data structure for example, converting between dimensions. In assignment 3 you will need to use this tool to create your student by student matrix. Good luck!

7.3 Code Workout 7

Code Workout 7

  • Passcode: t*T11#SJ

7.5 Assignment 3

Please fork and clone assignment 3 from the Github Repo. Due date is 10/21. A code workout will discuss this assignment on 8:30am EDT/5:30am PST/8:30pm CST/5:00pm IST on Thursday 10/15 (see Google calendar for details).

7.6 New Swirl Lessons Available

Instructions for installation:

First run:

uninstall_all_courses()

And then reinstall the new courses:

install_course_github("core-methods-in-edm", "swirl", multi = TRUE)

If your internet connection is limited due to bandwidth or VPN
* Download this version
* Unzip the file and then run:

install_course()

And find the file on your computer.

Week 8 Clustering

8.1 Code Workout 8

Code Workout 8

  • Passcode: ^jdit%7Y

8.2 Assignment 4

Please fork and clone assignment 4 from the Github Repo. Due date is 11/05 at 5:00pm EDT. A code workout will discuss this assignment on 8:30pm EDT/5:30pm PST/8:30am CST/5:00am IST on Tuesday 11/03/Wednesday 11/04 (see Google calendar for details).

8.3 Code Workout 9

Code Workout 9

  • Passcode: 5Q+5wd*t

8.4 Hackathon 4

Hackathon 4

  • Passcode: 6E=fzdPy

Task 1

Task 2

Task 3

Task 4

Week 9 Principal Component Analysis

9.1 Code Workout 10

Code Workout 10

  • Access Passcode: qEt8+*nU

9.2 Code Workout 11

Code Workout 11

  • Access Passcode: 8L0Nx$$g