1 Introduction

Statistics 240 is a first course in data science and statistical modeling at the University of Wisconsin - Madison. The course aims to enable you, the student in the course, to gain insight into real-world problems from messy data using methods of data science. These notes chart an initial path for you to gain the knowledge and skills needed to become a data scientist.

1.1 Case Studies

The structure of the course is to present a series of case studies that will allow you to discover answers to interesting questions through a process of data analysis from data sets related to some engaging real-world issues from a variety of domains, such as climate change, travel, health, animal behavior, the election, the search for exoplanets, and more. Each case study will take you through a common approach when encountered with new data: import, clean and tidy, transform, visualize, model, gain insight and understanding, and then communicate. In early case studies, many of the steps are provided so that you can focus on deep and detailed learning on a single stage of the data analysis process. In subsequent case studies, you will get to practice previously learned skills on new data while learning details about a different stage of the process. By the end of the semester, you will have gained a level of mastery which will allow you to carry out all steps of a basic analysis with novel data. You will have worked to gain a new power to learn about the world on your own.

1.2 Reproducibility, Scalability, and Writing Code

Reproducible data analysis requires writing code which may be shared with others, including your future self. These notes will help to teach you to use the RStudio integrated development environment (IDE) as your interface to writing code for the practice of data science. The R language and the RStudio IDE are not the only ways to approach data science, but they are common and the skills developed in this course are widely used in more advanced statistics courses and are valued in the workplace. If you have little previous experience in writing code or using a computer for purposes beyond web browsing, email, and playing games, you might wonder after a few days why we cannot just use an easier point-and-click software package. There are many reasons, but an important two are reproducibility and scalability. When you write good code, you can redo your analysis or share the code and data with someone else for them to replicate your analysis. If you come back to a project you worked on six months before, you can immediately run the code again to replicate the analysis without needing to remember which sequence of mouse clicks you did before. You might be able to reuse your code for new data, such as when a data set is updated with data from another year, or when you get data of a similar structure from another source. The kinds of hand-editing that might be feasible for a few dozen or hundred data values won’t be so feasible when confronted with data with hundreds of thousands of records. And you might want to do an analysis you did on one data set on hundreds of similar ones. The effort you put into learning to write code to do each step of an analysis pipeline will pay off in the future, but even in the short term in this class.

1.3 Course Structure

Your journey to mastery in this course will involve you traveling along several parallel paths.

Case Study Lectures
- The course presents several case studies which motivate learning through discovery of answers to questions of interest. Each case study includes a complete data analysis process from reading, cleaning, and wrangling data, exploration via visualization and data summarization, modeling to reveal structure, followed by interpretation and inference. In the beginning of the course, much of the work is done for you as you concentrate on mastering one key concept at a time. By the end of the course, you will have gained the ability to start from scratch with a new data set and be able to conduct a thorough analysis independently. You will learn from case studies through reading the course notes and through lecture.
Concept Mastery Lectures
- The case studies introduce new ideas as motivated to answer specific questions of the study. The concept lectures focus instead on broader details of concept, theory, and method, using both simple examples and data from the case studies. These lectures are aimed at increasing breadth of understanding rather than questions related to a specific situation.
Reading
- In addition to learning by watching your teacher explain things, you will also have regular reading assignments. There are two primary sources: the online textbook R for Data Science and these online course notes.
Discussion Sessions and Assignments
- Weekly during the semester, you have a short group assignment that focuses on one specific skill. The assignments are intended to be completed with your group during the discussion session with access to a teaching assistant to provide immediate answers to questions.
Individual Homework Assignments
- Most weeks, there will be a longer assignment with a mix of short and longer problems that ask you to practice skills learned from all of the preceding paths of learning.
Final Project
- In the second half of the course, you will work with a group to carry out a project where you exhibit your mastery of course topics on a case study with data chosen by your group to address the questions you wish to answer.
Exams
- There are two opportunities to show mastery of material on exams: first, midway through the semester, a midterm exam with in-person and take-home parts allows you to demonstrate mastery of concepts and practical skills you have built for working with data based on material included in the R for Data Science book and these course notes; and second, at the end of the semester in a comprehensive final exam which tests the first part on data management and the second part on statistical modeling and inference.

1.4 Computer and Software Requirements

To participate in the course, you will need to have a computer capable of running the R language through the RStudio environment. Both of these software packages are free and easy to install on the Windows, Macintosh, or Linux operating systems. In-person discussion sessions require access to laptop computers.

1.5 Next Steps

We want you to get to the stage where you can begin doing data analysis and writing code on your own computer as soon as possible. But to do so, we need some preliminary steps to set your computer up for these tasks. That is the aim of the next chapter.

Statistics 240 Course Notes