Data Science in R: A Gentle Introduction
Welcome!
Part I: Data exploration
1
Getting started in R
1.1
Download R and RStudio
1.2
First steps
Interacting with R
How you’ll get feedback
R as a calculator
R is case sensitive
1.3
Objects
1.4
Scripts
Creating and running scripts
A slightly more interesting script
Why can’t I just point and click?
1.5
Getting help
My advice
1.6
Libraries
Installing a library
Loading a library
Dealing with installation errors
2
Data
2.1
Data frames, cases, and variables
2.2
Samples versus populations
2.3
The unit of analysis
Example 1: politics
Example 2: baseball salaries
Example 3: politics again
Example 4: getting older
2.4
Importing a data set
2.5
Analyzing data: a short example
2.6
Importing data from the command line
3
Counting
3.1
Getting started: ACL Fest
3.2
Simple probabilities
Using xtabs alone
Using prop.table
Using pipes
3.3
Joint probabilities
P(A, B)
P(A or B) and the addition rule
3.4
Conditional probabilities
Calculating P(A | B) from data
Checking independence
Study questions
4
Plots
4.1
The grammar of graphics
4.2
The five basic plots
Scatter plots
Line graphs
Histograms
Boxplots
Bar plots
4.3
Customizing plots
Changing titles and labels
Color scales
Font size
Flipping the
\(x\)
and
\(y\)
axes
Plotting cheat sheet
Study questions
5
Summaries
5.1
The typical value
5.2
Variation
5.3
Extremes and quantiles
5.4
z-scores
6
Data wrangling
6.1
Key data verbs
group_by
filter
select
mutate
arrange
6.2
Complex summaries
Example 1: the five coldest months
Example 2: survival on the Titanic
Example 3: toy imports
6.3
Summary shortcuts
7
Fitting equations
7.1
What is a regression model?
7.2
Fitting regression models
7.3
Using and interpreting regression models
Summarizing a relationship
Making predictions
Making fair comparisons
Decomposing variation
7.4
Beyond straight lines
Exponential models
Power laws
Summary
Part II: Statistical inference
8
Statistical uncertainty
8.1
Sources of uncertainty
8.2
Sampling distributions
Real-world vs. statistical uncertainty
Example 1: dessert
What the sampling distribution tells us
Example 2: fishing
Summary
8.3
The truth about statistical uncertainty
Example 1: commuting
Example 2: dessert again
When is statistical inference useful?
9
The bootstrap
9.1
The bootstrap sampling distribution
9.2
Bootstrapping summaries
Example 1: sample mean
Bootstrap standard errors and confidence intervals
The biggest bootstrapping gotcha
Example 2: sample proportion
9.3
Bootstrapping differences
Example 1: sleep hours by gender
Example 2: smoking and depression
9.4
Bootstrapping regression models
Example 1: sleep versus age
Statistical vs. practical significance
Example 2: West Campus rents
9.5
Bootstrapping usually, but not always, works
What “confidence” means
Example 1: sample mean
Example 2: sample minimum
Closing advice
Study questions: the bootstrap
10
p-values
10.1
Example 1: did the Patriots cheat?
10.2
The four steps of hypothesis testing
10.3
Example 2: a disease cluster?
10.4
Interpreting p-values
11
Large-sample inference
11.1
The Central Limit Theorem
The normal distribution
de Moivre’s equation
The CLT in action
11.2
Confidence intervals for a mean
Example 1: sleep, revisited
t.test
shortcut
Example 2: cheese
11.3
Beyond de Moivre’s equation
The basic recipe of large-sample inference
Example 1: sample proportion
Example 2: difference of means
Example 3: difference of proportions
Example 4: regression model
Summary
Part III: Models
12
Experiments
12.1
Causal vs. statistical questions
12.2
The importance of a control group
12.3
Randomization
12.4
Blocking
12.5
Example: labor market discrimination
Study questions: experiments
13
Observational studies
13.1
Natural experiments
13.2
Matching
13.3
Cohort studies
14
Grouped data
14.1
Baseline/offset form
Example 1: heights
Example 2: cheese
Why do this?
14.2
Models with one dummy variable
Example: cheese
R can create dummy variables for you
14.3
Models with multiple dummy variables
The data
Model 1: FarAway + Littered
Using
factor
Model 2: adding subject-level effects
14.4
Interactions
What’s an interaction?
Example 1: Back to video games
When to include interactions
Example 2: Recidivism and employment
14.5
ANOVA: the analysis of variance
15
Regression
15.1
Numerical and grouping variables together
Example 1: causal confusion in house prices
Partial vs. overall relationships
Example 2: cheese sales, revisited
15.2
Interactions of numerical and grouping variables
Example 1: used Mercedes cars
Interaction vs. correlation
Example 2: SAT scores and GPA
15.3
Multiple numerical variables
Example 1: gas mileage
Partial slopes
Example 2: education spending
Example 3: Airbnb prices
15.4
Summarizing regression models
The regression table
Standardized coefficients
What about ANOVA?
15.5
Regression in the real world
The data
Model building
Model checking and further refinement
Statistical vs. practical significance, revisited
15.6
“What variables should I include?”
16
Prediction
16.1
Evaluating predictive models
Example: predicting the price of a house
Split, train, test
Overfitting
16.2
Feature engineering
Example 1: Capital Metro bus boardings
Example 2: Electricity demand in Texas
Summary
16.3
Feature selection
The problem: combinatorial explosion
One possible solution: stepwise selection
An example
17
Probability models
17.1
What is a probability model?
Describing randomness
A simple example
Expected value
17.2
Models for discrete outcomes
The binomial distribution
The Poisson distribution
17.3
The normal distribution, revisited
Some history
When is the normal distribution an appropriate model?
Application: modeling long-term asset returns
17.4
The bivariate normal distribution
Modeling correlation
Application: stocks and bonds
References
Published with bookdown
Data Science in R: A Gentle Introduction
References