Chapter 8 Step Functions
8.1 Step Functions
Step functions use cut-points c1,c2,…,cK in the range of X in order to construct K+1 dummy variables in the following way: C0(X)=I(X≤c1)C1(X)=I(c1<X≤c2) ⋮CK−1(X)=I(cK−1<X≤cK)CK(X)=I(cK<X) Above I(⋅) is the indicator function. For example I(c1<X≤c2)=1 if c1<X≤c2 and equals 0 otherwise.
Note: trivially we have that: C0(X)+C1(X)+…+CK(X)=1 since X must be in exactly one of the K+1 intervals.
After defining appropriate intervals and constructing the corresponding dummy variables we fit LS to the following model: y=f(x)=β0+β1C1(x)+β2C2(x)+…+βKCK(x)+ϵ. Inclusion of C0(x) is not needed because this effect is represented by β0.
This means that if X≤c1 our prediction is β0, while if cj<X≤cj+1 we predict β0+βj for j=1,…,K. In other words, β0 can be interpreted as the mean value of Y for X≤c1, and βj represents the average increase in the response for X in cj<X≤cj+1 relative to X≤c1.
8.1.1 Example: Wage Data
We apply step-function regression to model the Wage
data introduced in Section 7.2.3, with different numbers of step functions (Figure 8.1.
We can see that this modelling approach is local: new samples within a sampling interval affect only the predicted effect within that interval.
We see that this approach creates a bumpy fit to the data. This is desirable sometimes in applications in biomedicine or social sciences where the goal is to report overall summaries within specific age groups.
In other applications smoothness is preferable.

Figure 8.1: Step function regression with different numbers of cutpoints applied to the Wage data.
8.2 Practical Demonstration
We once again make use of the Boston
dataset in library MASS
.
library("MASS")
= Boston$medv
y = Boston$lstat
x = 'Median Property Value'
y.lab = 'Lower Status (%)' x.lab
For step function regression we can make use of the command cut()
, which automatically assigns samples to intervals given a specific number of intervals. We can check how this works by executing the following line of code.
table(cut(x, 2))
##
## (1.69,19.9] (19.9,38]
## 430 76
What we see is that cut(x, 2)
automatically created a factor with two levels, corresponding to the intervals (1.69,19.9] and (19.9,38], and assigned each entry in x to one of these factors depending on which interval it was in. The command table()
tells us that 430 samples of x
fall within the first interval and that 76 samples fall within the second interval. Note that cut(x, 2)
generated 2 intervals, but this means there is only 1 cutpoint (at 19.9). The number of cutpoints is naturally one less than the number of intervals, but it is important to be aware that cut
requires specification of the number of required intervals.
So, we can use cut()
within lm()
to easily fit regression models with step functions. Below we consider 4 models with 1, 2, 3 and 4 cutpoints (2, 3, 4 and 5 intervals) respectively.
= lm(y ~ cut(x, 2))
step2 = lm(y ~ cut(x, 3))
step3 = lm(y ~ cut(x, 4))
step4 = lm(y ~ cut(x, 5)) step5
The analysis then is essentially the same as previously. The following code produces the 2 by 2 plot of the fitted lines of the four models, along with approximate 95% confidence intervals for the mean predictions.
= predict(step2, newdata = list(x = sort(x)), se = TRUE)
pred2 = predict(step3, newdata = list(x = sort(x)), se = TRUE)
pred3 = predict(step4, newdata = list(x = sort(x)), se = TRUE)
pred4 = predict(step5, newdata = list(x = sort(x)), se = TRUE)
pred5
= cbind(pred2$fit + 2*pred2$se.fit, pred2$fit-2*pred2$se.fit)
se.bands2 = cbind(pred3$fit + 2*pred3$se.fit, pred3$fit-2*pred3$se.fit)
se.bands3 = cbind(pred4$fit + 2*pred4$se.fit, pred4$fit-2*pred4$se.fit)
se.bands4 = cbind(pred5$fit + 2*pred5$se.fit, pred5$fit-2*pred5$se.fit)
se.bands5
par(mfrow = c(2,2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "1 cutpoint", bty = 'l')
lines(sort(x), pred2$fit, lwd = 2, col = "red")
matlines(sort(x), se.bands2, lwd = 1.4, col = "red", lty = 3)
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "2 cutpoints", bty = 'l')
lines(sort(x), pred3$fit, lwd = 2, col = "darkviolet")
matlines(sort(x), se.bands3, lwd = 1.4, col = "darkviolet", lty = 3)
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "3 cutpoints", bty = 'l')
lines(sort(x), pred4$fit, lwd = 2, col = "blue")
matlines(sort(x), se.bands4, lwd = 1.4, col = "blue", lty = 3)
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
main = "4 cutpoints", bty = 'l')
lines(sort(x), pred5$fit, lwd = 2, col = "black")
matlines(sort(x), se.bands5, lwd = 1.4, col = "black", lty = 3)
Note that we do not necessarily need to rely on the automatic selections of cutpoints used by cut()
. We can define the intervals if we want to. For instance, if we want cutpoints at 10, 20 and 30 we can do the following
= c(min(x), 10, 20, 30, max(x))
breaks4 table(cut(x, breaks = breaks4))
##
## (1.73,10] (10,20] (20,30] (30,38]
## 218 213 62 12
Note that when defining the sequence of breaks, we included min(x)
and max(x)
at the start and end in order to ensure the intervals covered the entire range of x
.
Then our model would be:
= lm(y ~ cut(x, breaks = breaks4))
step.new4 summary(step.new4)
##
## Call:
## lm(formula = y ~ cut(x, breaks = breaks4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4803 -4.6239 -0.4239 2.8968 20.6197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.3803 0.4415 66.540 <2e-16 ***
## cut(x, breaks = breaks4)(10,20] -10.4563 0.6281 -16.648 <2e-16 ***
## cut(x, breaks = breaks4)(20,30] -16.6770 0.9383 -17.773 <2e-16 ***
## cut(x, breaks = breaks4)(30,38] -18.6886 1.9331 -9.668 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.519 on 501 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4925, Adjusted R-squared: 0.4895
## F-statistic: 162.1 on 3 and 501 DF, p-value: < 2.2e-16
Although somewhat trivial, we can make predictions at new data points using the constructed linear model as usual.
<- c(10.56, 5.89)
newx = predict(step.new4, newdata = list(x = newx), se = TRUE)
preds preds
## $fit
## 1 2
## 18.92394 29.38028
##
## $se.fit
## 1 2
## 0.4466955 0.4415432
##
## $df
## [1] 501
##
## $residual.scale
## [1] 6.519307