13.3 Experimental Estimates of the Effect of Class Size Reductions
Experimental Design and the Data Set
The Project Student-Teacher Achievement Ratio (STAR) was a large randomized controlled experiment with the aim of asserting whether a class size reduction is effective in improving education outcomes. It has been conducted in 80 Tennessee elementary schools over a period of four years during the 1980s by the State Department of Education.
In the first year, about 6400 students were randomly assigned into one of three interventions: small class (13 to 17 students per teacher), regular class (22 to 25 students per teacher), and regular-with-aide class (22 to 25 students with a full-time teacher’s aide). Teachers were also randomly assigned to the classes they taught. The interventions were initiated as the students entered school in kindergarten and continued through to third grade. Control and treatment groups across grades are summarized in Table 13.1.
K | 1 | 2 | 3 | |
---|---|---|---|---|
Treatment 1 | Small class | Small class | Small class | Small class |
Treatment 2 | Regular class + aide | Regular class + aide | Regular class + aide | Regular class + aide |
Control | Regular class | Regular class | Regular class | Regular class |
Each year, the students’ learning progress was assessed using the sum of the points scored on the math and reading parts of a standardized test (the Stanford Achievement Test).
The STAR data set is part of the package AER.
# load the package AER and the STAR dataset
library(AER)
data(STAR)
head(STAR) shows that there is a variety of factor variables that describe student and teacher characteristics as well as various school indicators, all of which are separately recorded for the four different grades. The data is in wide format. That is, each variable has its own column and for each student, the rows contain observations on these variables. Using dim(STAR) we find that there are a total of 11598 observations on 47 variables.
# get an overview
head(STAR, 2)
## gender ethnicity birth stark star1 star2 star3 readk read1 read2
## 1122 female afam 1979 Q3 <NA> <NA> <NA> regular NA NA NA
## 1137 female cauc 1980 Q1 small small small small 447 507 568
## read3 mathk math1 math2 math3 lunchk lunch1 lunch2 lunch3 schoolk
## 1122 580 NA NA NA 564 <NA> <NA> <NA> free <NA>
## 1137 587 473 538 579 593 non-free free non-free free rural
## school1 school2 school3 degreek degree1 degree2 degree3 ladderk
## 1122 <NA> <NA> suburban <NA> <NA> <NA> bachelor <NA>
## 1137 rural rural rural bachelor bachelor bachelor bachelor level1
## ladder1 ladder2 ladder3 experiencek experience1 experience2
## 1122 <NA> <NA> level1 NA NA NA
## 1137 level1 apprentice apprentice 7 7 3
## experience3 tethnicityk tethnicity1 tethnicity2 tethnicity3 systemk
## 1122 30 <NA> <NA> <NA> cauc <NA>
## 1137 1 cauc cauc cauc cauc 30
## system1 system2 system3 schoolidk schoolid1 schoolid2 schoolid3
## 1122 <NA> <NA> 22 <NA> <NA> <NA> 54
## 1137 30 30 30 63 63 63 63
dim(STAR)
## [1] 11598 47
# get variable names
names(STAR)
## [1] "gender" "ethnicity" "birth" "stark" "star1"
## [6] "star2" "star3" "readk" "read1" "read2"
## [11] "read3" "mathk" "math1" "math2" "math3"
## [16] "lunchk" "lunch1" "lunch2" "lunch3" "schoolk"
## [21] "school1" "school2" "school3" "degreek" "degree1"
## [26] "degree2" "degree3" "ladderk" "ladder1" "ladder2"
## [31] "ladder3" "experiencek" "experience1" "experience2" "experience3"
## [36] "tethnicityk" "tethnicity1" "tethnicity2" "tethnicity3" "systemk"
## [41] "system1" "system2" "system3" "schoolidk" "schoolid1"
## [46] "schoolid2" "schoolid3"
A majority of the variable names contain a suffix (k, 1, 2 or 3) stating the grade which the respective variable is referring to. This facilitates regression analysis because it allows to adjust the formula argument in lm() for each grade by simply changing the variables’ suffixes accordingly.
The outcome produced by head() shows that some values recorded are NA and STAR[1,]
.
In the output of head(STAR, 2)
we find that the student entered the experiment in third grade in a regular class, which is why the class size is recorded in star3 and the other class type indicator variables are
# drop NA recordings for the first observation and print to the console
STAR[1, !is.na(STAR[1, ])]
## gender ethnicity birth star3 read3 math3 lunch3 school3 degree3
## 1122 female afam 1979 Q3 regular 580 564 free suburban bachelor
## ladder3 experience3 tethnicity3 system3 schoolid3
## 1122 level1 30 cauc 22 54
is.na(STAR[1, ])
returns a logical vector with TRUE at positions that correspond to
In general it is not necessary to remove rows with missing data because lm() does so by default. Missing data may imply a small sample size and thus may lead to imprecise estimation and wrong inference This is, however, not an issue for the study at hand since, as we will see below, sample sizes lie beyond 5000 observations for each regression conducted.
Analysis of the STAR Data
As can be seen from Table 13.1 there are two treatment groups in each grade, small classes with only 13 to 17 students and regular classes with 22 to 25 students and a teaching aide. Thus, two binary variables, each being an indicator for the respective treatment group, are introduced for the differences estimator to capture the treatment effect for each treatment group separately. This yields the population regression model \[\begin{align} Y_i = \beta_0 + \beta_1 SmallClass_i + \beta_2 RegAide_i + u_i, \tag{13.3} \end{align}\]with test score \(Y_i\), the small class indicator \(SmallClass_i\) and \(RegAide_i\), the indicator for a regular class with aide.
We reproduce the results presented in Table 13.1 of the book by performing the regression (13.3) for each grade separately. For each student, the dependent variable is simply the sum of the points scored in the math and reading parts, constructed using I().
# compute differences Estimates for each grades
fmk <- lm(I(readk + mathk) ~ stark, data = STAR)
fm1 <- lm(I(read1 + math1) ~ star1, data = STAR)
fm2 <- lm(I(read2 + math2) ~ star2, data = STAR)
fm3 <- lm(I(read3 + math3) ~ star3, data = STAR)
# obtain coefficient matrix using robust standard errors
coeftest(fmk, vcov = vcovHC, type= "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 918.04289 1.63339 562.0473 < 2.2e-16 ***
## starksmall 13.89899 2.45409 5.6636 1.554e-08 ***
## starkregular+aide 0.31394 2.27098 0.1382 0.8901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm1, vcov = vcovHC, type= "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1039.3926 1.7846 582.4321 < 2.2e-16 ***
## star1small 29.7808 2.8311 10.5190 < 2.2e-16 ***
## star1regular+aide 11.9587 2.6520 4.5093 6.62e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm2, vcov = vcovHC, type= "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1157.8066 1.8151 637.8820 < 2.2e-16 ***
## star2small 19.3944 2.7117 7.1522 9.55e-13 ***
## star2regular+aide 3.4791 2.5447 1.3672 0.1716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fm3, vcov = vcovHC, type= "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1228.50636 1.68001 731.2483 < 2.2e-16 ***
## star3small 15.58660 2.39604 6.5051 8.393e-11 ***
## star3regular+aide -0.29094 2.27271 -0.1280 0.8981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We gather the results and present them in a table using stargazer().
# compute robust standard errors for each model and gather them in a list
rob_se_1 <- list(sqrt(diag(vcovHC(fmk, type = "HC1"))),
sqrt(diag(vcovHC(fm1, type = "HC1"))),
sqrt(diag(vcovHC(fm2, type = "HC1"))),
sqrt(diag(vcovHC(fm2, type = "HC1"))))
library(stargazer)
stargazer(fmk,fm1,fm2,fm3,
title = "Project STAR: Differences Estimates",
header = FALSE,
type = "latex",
model.numbers = F,
omit.table.layout = "n",
digits = 3,
column.labels = c("K", "1", "2", "3"),
dep.var.caption = "Dependent Variable: Grade",
dep.var.labels.include = FALSE,
se = rob_se_1)
Dependent Variable: Grade | ||||
K | 1 | 2 | 3 | |
starksmall | 13.90*** | |||
(2.45) | ||||
starkregular+aide | 0.31 | |||
(2.27) | ||||
star1small | 29.78*** | |||
(2.83) | ||||
star1regular+aide | 11.96*** | |||
(2.65) | ||||
star2small | 19.39*** | |||
(2.71) | ||||
star2regular+aide | 3.48 | |||
(2.54) | ||||
star3small | 15.59 | |||
star3regular+aide | -0.29 | |||
Constant | 918.04*** | 1,039.39*** | 1,157.81*** | 1,228.51*** |
(1.63) | (1.78) | (1.82) | (1.82) | |
Observations | 5,786 | 6,379 | 6,049 | 5,967 |
R2 | 0.01 | 0.02 | 0.01 | 0.01 |
Adjusted R2 | 0.01 | 0.02 | 0.01 | 0.01 |
Residual Std. Error | 73.49 (df = 5783) | 90.50 (df = 6376) | 83.69 (df = 6046) | 72.91 (df = 5964) |
F Statistic | 21.26*** (df = 2; 5783) | 56.34*** (df = 2; 6376) | 28.71*** (df = 2; 6046) | 30.25*** (df = 2; 5964) |
Table 13.2: Project STAR - Differences Estimates
The estimates presented in Table 13.2 suggest that the class size reduction improves student performance. Except for grade 1, the estimates of the coefficient on \(SmallClass\) are roughly of the same magnitude (the estimates lie between 13.90 and 19.39 points) and they are statistically significant at \(1\%\). Furthermore, a teaching aide has little, possibly zero, effect on the performance of the students.
Following the book, we augment the regression model (13.3) by different sets of regressors for two reasons:
- If the additional regressors explain some of the observed variation in the dependent variable, we obtain more efficient estimates of the coefficients of interest.
- If the treatment is not received at random due to failures to follow the treatment protocol (see Chapter 13.3 of the book), the estimates obtained using (13.3) may be biased. Adding additional regressors may solve or mitigate this problem.
In particular, we consider the following student and teacher characteristics
- \(experience\) — Teacher’s years of experience
- \(boy\) — Student is a boy (dummy)
- \(lunch\) — Free lunch eligibility (dummy)
- \(black\) — Student is African-American (dummy)
- \(race\) — Student’s race is other than black or white (dummy)
- \(\text{schoolid}\) — School indicator variables
Prior to estimation, we perform some subsetting and data wrangling using functions from the packages dplyr and tidyr. These are both part of tidyverse, a collection of R packages designed for data science and handling big datasets (see the official site for more on tidyverse packages). The functions %>%, transmute() and mutate() are sufficient for us here:
- %>% allows to chain function calls.
- transmute() allows to subset the data set by naming the variables to be kept.
- mutate() is convenient for adding new variables based on existing ones while preserving the latter.
The regression models (13.4) to (13.7) require the variables gender, ethnicity, stark, readk, mathk, lunchk, experiencek and schoolidk. After dropping the remaining variables using transmute(), we use mutate() to add three additional binary variables which are derivatives of existing ones: black, race and boy. They are generated using logical statements within the function ifelse().
# load packages 'dplyr' and 'tidyr' for data wrangling functionalities
library(dplyr)
library(tidyr)
# generate subset with kindergarten data
STARK <- STAR %>%
transmute(gender,
ethnicity,
stark,
readk,
mathk,
lunchk,
experiencek,
schoolidk) %>%
mutate(black = ifelse(ethnicity == "afam", 1, 0),
race = ifelse(ethnicity == "afam" | ethnicity == "cauc", 1, 0),
boy = ifelse(gender == "male", 1, 0))
# estimate the models
gradeK1 <- lm(I(mathk + readk) ~ stark + experiencek,
data = STARK)
gradeK2 <- lm(I(mathk + readk) ~ stark + experiencek + schoolidk,
data = STARK)
gradeK3 <- lm(I(mathk + readk) ~ stark + experiencek + boy + lunchk
+ black + race + schoolidk,
data = STARK)
For brevity, we exclude the coefficients for the indicator dummies in coeftest()’s output by subsetting the matrices.
# obtain robust inference on the significance of coefficients
coeftest(gradeK1, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 904.72124 2.22235 407.1020 < 2.2e-16 ***
## starksmall 14.00613 2.44704 5.7237 1.095e-08 ***
## starkregular+aide -0.60058 2.25430 -0.2664 0.7899
## experiencek 1.46903 0.16929 8.6778 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(gradeK2, vcov. = vcovHC, type = "HC1")[1:4, ]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 925.6748750 7.6527218 120.9602155 0.000000e+00
## starksmall 15.9330822 2.2411750 7.1092540 1.310324e-12
## starkregular+aide 1.2151960 2.0353415 0.5970477 5.504993e-01
## experiencek 0.7431059 0.1697619 4.3773429 1.222880e-05
coeftest(gradeK3, vcov. = vcovHC, type = "HC1")[1:7, ]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 937.6831330 14.3726687 65.2407117 0.000000e+00
## starksmall 15.8900507 2.1551817 7.3729516 1.908960e-13
## starkregular+aide 1.7869378 1.9614592 0.9110247 3.623211e-01
## experiencek 0.6627251 0.1659298 3.9940097 6.578846e-05
## boy -12.0905123 1.6726331 -7.2284306 5.533119e-13
## lunchkfree -34.7033021 1.9870366 -17.4648529 1.437931e-66
## black -25.4305130 3.4986918 -7.2685776 4.125252e-13
We now use stargazer() to gather all relevant information in a structured table.
# compute robust standard errors for each model and gather them in a list
rob_se_2 <- list(sqrt(diag(vcovHC(fmk, type = "HC1"))),
sqrt(diag(vcovHC(gradeK1, type = "HC1"))),
sqrt(diag(vcovHC(gradeK2, type = "HC1"))),
sqrt(diag(vcovHC(gradeK3, type = "HC1"))))
stargazer(fmk, fm1, fm2, fm3,
title = "Project STAR - Differences Estimates with
Additional Regressors for Kindergarten",
header = FALSE,
type = "latex",
model.numbers = F,
omit.table.layout = "n",
digits = 3,
column.labels = c("(1)", "(2)", "(3)", "(4)"),
dep.var.caption = "Dependent Variable: Test Score in Kindergarten",
dep.var.labels.include = FALSE,
se = rob_se_2)
Dependent Variable: Test Score in Kindergarten | ||||
(1) | (2) | (3) | (4) | |
starksmall | 13.899*** | 14.006*** | 15.933*** | 15.890*** |
(2.454) | (2.447) | (2.241) | (2.155) | |
starkregular+aide | 0.314 | -0.601 | 1.215 | 1.787 |
(2.271) | (2.254) | (2.035) | (1.961) | |
experiencek | 1.469*** | 0.743*** | 0.663*** | |
(0.169) | (0.170) | (0.166) | ||
boy | -12.091*** | |||
(1.673) | ||||
lunchkfree | -34.703*** | |||
(1.987) | ||||
black | -25.431*** | |||
(3.499) | ||||
race | 8.501 | |||
(12.520) | ||||
Constant | 918.043*** | 904.721*** | 925.675*** | 937.683*** |
(1.633) | (2.222) | (7.653) | (14.373) | |
School indicators? | no | no | yes | yes |
Observations | 5,786 | 5,766 | 5,766 | 5,748 |
R2 | 0.007 | 0.020 | 0.234 | 0.291 |
Adjusted R2 | 0.007 | 0.020 | 0.223 | 0.281 |
Residual Std. Error | 73.490 (df = 5783) | 73.085 (df = 5762) | 65.075 (df = 5684) | 62.663 (df = 5662) |
F Statistic | 21.263*** (df = 2; 5783) | 39.861*** (df = 3; 5762) | 21.413*** (df = 81; 5684) | 27.364*** (df = 85; 5662) |
Table 13.3: Project STAR - Differences Estimates with Additional Regressors for Kindergarten
The results in column (1) of Table 13.3 just a repeat the results obtained for (13.3). Columns (2) to (4) reveal that adding student characteristics and school fixed effects does not lead to substantially different estimates of the treatment effects. This result makes it more plausible that the estimates of the effects obtained using model (13.3) do not suffer from failure of random assignment. There is some decrease in the standard errors and some increase in \(\bar{R}^2\), implying that the estimates are more precise.
Because teachers were randomly assigned to classes, inclusion of school fixed effect allows us to estimate the causal effect of a teacher’s experience on test scores of students in kindergarten. Regression (3) predicts the average effect of 10 years experience on test scores to be \(10\cdot 0.74=7.4\) points. Be aware that the other estimates on student characteristics in regression (4) do not have causal interpretation due to nonrandom assignment (see Chapter 13.3 of the book for a detailed discussion).
Are the estimated effects presented in Table 13.3 large or small in a practical sense? Let us translate the predicted changes in test scores to units of standard deviation in order to allow for a comparison (see Section 9.4 for a similar argument).
# compute the sample standard deviations of test scores
SSD <- c("K" = sd(na.omit(STAR$readk + STAR$mathk)),
"1" = sd(na.omit(STAR$read1 + STAR$math1)),
"2" = sd(na.omit(STAR$read2 + STAR$math2)),
"3" = sd(na.omit(STAR$read3 + STAR$math3)))
# translate the effects of small classes to standard deviations
Small <- c("K" = as.numeric(coef(fmk)[2]/SSD[1]),
"1" = as.numeric(coef(fm1)[2]/SSD[2]),
"2" = as.numeric(coef(fm2)[2]/SSD[3]),
"3" = as.numeric(coef(fm3)[2]/SSD[4]))
# adjust the standard errors
SmallSE <- c("K" = as.numeric(rob_se_1[[1]][2]/SSD[1]),
"1" = as.numeric(rob_se_1[[2]][2]/SSD[2]),
"2" = as.numeric(rob_se_1[[3]][2]/SSD[3]),
"3" = as.numeric(rob_se_1[[4]][2]/SSD[4]))
# translate the effects of regular classes with aide to standard deviations
RegAide<- c("K" = as.numeric(coef(fmk)[3]/SSD[1]),
"1" = as.numeric(coef(fm1)[3]/SSD[2]),
"2" = as.numeric(coef(fm2)[3]/SSD[3]),
"3" = as.numeric(coef(fm3)[3]/SSD[4]))
# adjust the standard errors
RegAideSE <- c("K" = as.numeric(rob_se_1[[1]][3]/SSD[1]),
"1" = as.numeric(rob_se_1[[2]][3]/SSD[2]),
"2" = as.numeric(rob_se_1[[3]][3]/SSD[3]),
"3" = as.numeric(rob_se_1[[4]][3]/SSD[4]))
# gather the results in a data.frame and round
df <- t(round(data.frame(
Small, SmallSE, RegAide, RegAideSE, SSD),
digits = 2))
It is fairly easy to turn the data.frame df into a table.
# generate a simple table using stargazer
stargazer(df,
title = "Estimated Class Size Effects
(in Units of Standard Deviations)",
type = "html",
summary = FALSE,
header = FALSE
)
K | 1 | 2 | 3 | |
Small | 0.190 | 0.330 | 0.230 | 0.210 |
SmallSE | 0.030 | 0.030 | 0.030 | 0.040 |
RegAide | 0 | 0.130 | 0.040 | 0 |
RegAideSE | 0.030 | 0.030 | 0.030 | 0.030 |
SSD | 73.750 | 91.280 | 84.080 | 73.270 |
Table 13.4: Estimated Class Size Effects (in Units of Standard Deviations)
The estimated effect of a small classes is largest for grade 1. As pointed out in the book, this is probably because students in the control group for grade 1 did poorly on the test for some unknown reason or simply due to random variation. The difference between the estimated effect of being in a small class and being in a regular classes with an aide is roughly 0.2 standard deviations for all grades. This leads to the conclusion that the effect of being in a regular sized class with an aide is zero and the effect of being in a small class is roughly the same for all grades.
The remainder of Chapter 13.3 in the book discusses to what extent these experimental estimates are comparable with observational estimates obtained using data on school districts in California and Massachusetts in Chapter 9. It turns out that the estimates are indeed very similar. Please refer to the aforementioned section in the book for a more detailed discussion.