ISTA 321 - Data Mining
Welcome to ISTA 321 - Data Mining! The goal of this class is to teach you how to use R to make informed inferences and predictions from large datasets using a variety of methods. This requires a mixture of many skills including programming, data exploration and visualizations, statistics, algorithms, machine learning, model validation, and general data wrangling. We don’t do these things in isolation, but instead do them with a goal of answering a question, thus being able to apply this knowledge to make a data-driven decision is equally critical.
1.1 Critical resources and information
The syllabus is posted on Slack and D2L in the annoucements. Please make sure you read it!
The main book that you’ll use is the website you’re on right now. I wrote this to provide a text that teaches the key coding, algorithm, and conceptual ideas concurrently. When I refer to ‘my book’ I’m talking about this website.
We’ll also be using the textbook Introduction to Statistical Learning in R, which is available free at that link. It is a wonderfully written book, and if you’re going to be sticking in the data world I highly suggest buying a paper copy.
I will assign chapters from the book that are meant to supplement the lessons in my book. My lessons here are already quite long, so they tend to avoid some of the deeper conceptual issues that the book covers. You will be tested on both, so be sure you’re actually reading!
1.1.3 R and R Studio
We use the R programming language in this class and I highly suggest using the R Studio IDE. If you already have them installed, then please take a few minutes to update to the latest version of both.
You only need to have taken one class that has used R to do well in this class. Alongside most weekly lessons I have supplementary programming lessons. For example, there are some on for loops and data visualization. If you already have taken a bunch of programming classes you might not need these. But for those of you who haven’t, these are there to get you up to speed.
If you want to learn more R basics I suggest checking out Garrett Grolemund and Hadley Wickham’s R for Data Science book which is free online.
1.1.4 Copy-Pasting Code
A persistent issue that I encounter in this class is that students will just copy and paste code from the lessons into R and then try and tweak them for their homework. I can’t stress enough how much I don’t want you to do this. What normally happens with these students is that they never learn how each function and argument within the functions interact. So although they got the lesson to run, when they go to tweak this for their homework they make one small issue (spelling error, indexing error, data type error, etc.) and then everything breaks. Obviously they don’t do well on their homework.
Also, even if you’re a moderately experienced coder and don’t make the small mistakes, going through the process is critical to developing your knowledge of the language.
Point being, please code from scratch for both the lessons and the homework. You will do much better on the exams and walk away a significantly better programmer and data scientist if you do this.
1.1.5 Slack and Getting Help
Although I’m the professor of this course, I should not be the first person you go to with a content question. Instead, you all should post up the question about coding, understanding a concept, how to interpret an output, etc. to the appropriate channel on the course Slack. If you are nailing all these things, then take time to answer those questions! 10% of everyone’s grade comes from just participating in this way on Slack. It’s easy points, and better represents how you should go about getting help in the real world… use your coworkers first, not your boss.
When it comes to actually posting code on Slack please use Slack’s formatting feature. If you just use three ``` before and after the code it’ll format it as such. Also, please don’t post full scripts. Backtrace the error and post just the line or couple preceding lines.
1.1.6 Working Together
I don’t want to stop anyone from working together, although I overall encourage people to work alone. The lessons are structured in a linear way where they build continuously. They’re not just a bunch of unrelated questions where you can divide and conquer. As a result, when people work together on the homework it tends to be one person doing all the coding while the other just follows. Predictably, the follower then tends to do really poorly on the exam.
If you do work in a pair, it needs to be clear it’s your own work. This means you must make sure your annotation, answers to application, etc. clearly shows that although you coded some stuff together you didn’t just straight copy the assignment.