3.1 Review on parametric regression
We review now a couple of useful parametric regression models that will be used in the construction of nonparametric regression models.
3.1.1 Linear regression
Model formulation and least squares
The multiple linear regression employs multiple predictors X1,…,XpX1,…,Xp9 for explaining a single response YY by assuming that a linear relation of the form
Y=β0+β1X1+⋯+βpXp+εY=β0+β1X1+⋯+βpXp+ε(3.1)
holds between the predictors X1,…,XpX1,…,Xp and the response Y.Y. In (3.1), β0β0 is the intercept and β1,…,βpβ1,…,βp are the slopes, respectively. εε is a random variable with mean zero and independent from X1,…,Xp.X1,…,Xp. Another way of looking at (3.1) is
E[Y|X1=x1,…,Xp=xp]=β0+β1x1+⋯+βpxp,E[Y|X1=x1,…,Xp=xp]=β0+β1x1+⋯+βpxp,(3.2)
since E[ε|X1=x1,…,Xp=xp]=0.E[ε|X1=x1,…,Xp=xp]=0. Therefore, the mean of YY is changing in a linear fashion with respect to the values of X1,…,Xp.X1,…,Xp. Hence the interpretation of the coefficients:
- β0β0: is the mean of YY when X1=…=Xp=0.X1=…=Xp=0.
- βj,βj, 1≤j≤p1≤j≤p: is the additive increment in mean of YY for an increment of one unit in Xj=xj,Xj=xj, provided that the remaining variables do not change.
Figure 3.1 illustrates the geometrical interpretation of a multiple linear model: a plane in the (p+1)(p+1)-dimensional space. If p=1,p=1, the plane is the regression line for simple linear regression. If p=2,p=2, then the plane can be visualized in a three-dimensional plot.

Figure 3.1: The regression plane (blue) of YY on X1X1 and X2,X2, and its relation with the regression lines (green lines) of YY on X1X1 (left) and YY on X2X2 (right). The red points represent the sample for (X1,X2,Y)(X1,X2,Y) and the black points the subsamples for (X1,X2)(X1,X2) (bottom), (X1,Y)(X1,Y) (left) and (X2,Y)(X2,Y) (right). Note that the regression plane is not a direct extension of the marginal regression lines.
The estimation of β0,β1,…,βpβ0,β1,…,βp is done by minimizing the so-called Residual Sum of Squares (RSS). First we need to introduce some helpful matrix notation:
A sample of (X1,…,Xp,Y)(X1,…,Xp,Y) is denoted by (X11,…,X1p,Y1),\allowbreak…,(Xn1,…,Xnp,Yn),(X11,…,X1p,Y1),\allowbreak…,(Xn1,…,Xnp,Yn), where XijXij denotes the ii-th observation of the jj-th predictor Xj.Xj. We denote with Xi=(Xi1,…,Xip)Xi=(Xi1,…,Xip) to the ii-th observation of (X1,…,Xp),(X1,…,Xp), so the sample simplifies to (X1,Y1),…,(Xn,Yn).(X1,Y1),…,(Xn,Yn).
The design matrix contains all the information of the predictors and a column of ones
X=(1X11⋯X1p⋮⋮⋱⋮1Xn1⋯Xnp)n×(p+1).X=⎛⎜ ⎜⎝1X11⋯X1p⋮⋮⋱⋮1Xn1⋯Xnp⎞⎟ ⎟⎠n×(p+1).
The vector of responses Y,Y, the vector of coefficients ββ and the vector of errors are, respectively,
Y=(Y1⋮Yn)n×1,β=(β0β1⋮βp)(p+1)×1, and ε=(ε1⋮εn)n×1.Y=⎛⎜ ⎜⎝Y1⋮Yn⎞⎟ ⎟⎠n×1,β=⎛⎜ ⎜ ⎜ ⎜⎝β0β1⋮βp⎞⎟ ⎟ ⎟ ⎟⎠(p+1)×1, and ε=⎛⎜ ⎜⎝ε1⋮εn⎞⎟ ⎟⎠n×1.
Thanks to the matrix notation, we can turn the sample version of the multiple linear model, namely
Yi=β0+β1Xi1+⋯+βpXik+εi,i=1,…,n,Yi=β0+β1Xi1+⋯+βpXik+εi,i=1,…,n,
into something as compact as
Y=Xβ+ε.Y=Xβ+ε.
The RSS for the multiple linear regression is
RSS(β):=n∑i=1(Yi−β0−β1Xi1−⋯−βpXik)2=(Y−Xβ)′(Y−Xβ).RSS(β):=n∑i=1(Yi−β0−β1Xi1−⋯−βpXik)2=(Y−Xβ)′(Y−Xβ).(3.3)
The RSS aggregates the squared vertical distances from the data to a regression plane given by β.β. Note that the vertical distances are considered because we want to minimize the error in the prediction of Y.Y. Thus, the treatment of the variables is not symmetrical10; see Figure 3.2. The least squares estimators are the minimizers of the RSS:
ˆβ:=argminβ∈Rp+1RSS(β).^β:=argminβ∈Rp+1RSS(β).
Luckily, thanks to the matrix form of (3.3), it is simple to compute a closed-form expression for the least squares estimates:
ˆβ=(X′X)−1X′Y.^β=(X′X)−1X′Y.(3.4)
Exercise 3.1 ˆβ^β can be obtained differentiating (3.3). Prove it using that ∂Ax∂x=A∂Ax∂x=A and ∂f(x)′g(x)∂x=f(x)′∂g(x)∂x+g(x)′∂f(x)∂x∂f(x)′g(x)∂x=f(x)′∂g(x)∂x+g(x)′∂f(x)∂x for two vector-valued functions ff and g.g.
Figure 3.2: The least squares regression plane y=ˆβ0+ˆβ1x1+ˆβ2x2y=^β0+^β1x1+^β2x2 and its dependence on the kind of squared distance considered. Application also available here.
Let’s check that indeed the coefficients given by R’s lm
are the ones given by (3.4) in a toy linear model.
# Create the data employed in Figure 3.1
# Generates 50 points from a N(0, 1): predictors and error
set.seed(34567)
<- rnorm(50)
x1 <- rnorm(50)
x2 <- x1 + rnorm(50, sd = 0.05) # Make variables dependent
x3 <- rnorm(50)
eps
# Responses
<- -0.5 + 0.5 * x1 + 0.5 * x2 + eps
yLin <- -0.5 + x1^2 + 0.5 * x2 + eps
yQua <- -0.5 + 0.5 * exp(x2) + x3 + eps
yExp
# Data
<- data.frame(x1 = x1, x2 = x2, yLin = yLin,
dataAnimation yQua = yQua, yExp = yExp)
# Call lm
# lm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done
<- lm(yLin ~ x1 + x2, data = dataAnimation)
mod summary(mod)
##
## Call:
## lm(formula = yLin ~ x1 + x2, data = dataAnimation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.37003 -0.54305 0.06741 0.75612 1.63829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5703 0.1302 -4.380 6.59e-05 ***
## x1 0.4833 0.1264 3.824 0.000386 ***
## x2 0.3215 0.1426 2.255 0.028831 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9132 on 47 degrees of freedom
## Multiple R-squared: 0.276, Adjusted R-squared: 0.2452
## F-statistic: 8.958 on 2 and 47 DF, p-value: 0.0005057
# mod is a list with a lot of information
# str(mod) # Long output
# Coefficients
$coefficients
mod## (Intercept) x1 x2
## -0.5702694 0.4832624 0.3214894
# Application of formula (3.4)
# Matrix X
<- cbind(1, x1, x2)
X
# Vector Y
<- yLin
Y
# Coefficients
<- solve(t(X) %*% X) %*% t(X) %*% Y
beta
beta## [,1]
## -0.5702694
## x1 0.4832624
## x2 0.3214894
Exercise 3.2 Compute ββ for the regressions yLin ~ x1 + x2
, yQua ~ x1 + x2
, and yExp ~ x2 + x3
using equation (3.4) and the function lm
. Check that the fitted plane and the coefficient estimates are coherent.
Once we have the least squares estimates ˆβ,^β, we can define the next two concepts:
The fitted values ˆY1,…,ˆYn,^Y1,…,^Yn, where
ˆYi:=ˆβ0+ˆβ1Xi1+⋯+ˆβpXik,i=1,…,n.^Yi:=^β0+^β1Xi1+⋯+^βpXik,i=1,…,n.
They are the vertical projections of Y1,…,YnY1,…,Yn into the fitted line (see Figure 3.2). In a matrix form, inputting (3.3)
ˆY=Xˆβ=X(X′X)−1X′Y=HY,^Y=X^β=X(X′X)−1X′Y=HY,
where H:=X(X′X)−1X′H:=X(X′X)−1X′ is called the hat matrix because it “puts the hat into YY.” What it does is to project YY into the regression plane (see Figure 3.2).
The residuals (or estimated errors) ˆε1,…,ˆεn,^ε1,…,^εn, where
ˆεi:=Yi−ˆYi,i=1,…,n.^εi:=Yi−^Yi,i=1,…,n.
They are the vertical distances between actual data and fitted data.
Model assumptions
Up to now, we have not made any probabilistic assumption on the data generation process. ˆβ^β was derived from purely geometrical arguments, not probabilistic ones. However, some probabilistic assumptions are required for inferring the unknown population coefficients ββ from the sample (X1,Y1),…,(Xn,Yn).(X1,Y1),…,(Xn,Yn).
The assumptions of the multiple linear model are:
- Linearity: E[Y|X1=x1,…,Xp=xp]=β0+β1x1+⋯+βpxp.E[Y|X1=x1,…,Xp=xp]=β0+β1x1+⋯+βpxp.
- Homoscedasticity: Var[εi]=σ2,Var[εi]=σ2, with σ2σ2 constant for i=1,…,n.i=1,…,n.
- Normality: εi∼N(0,σ2)εi∼N(0,σ2) for i=1,…,n.i=1,…,n.
- Independence of the errors: ε1,…,εnε1,…,εn are independent (or uncorrelated, E[εiεj]=0,E[εiεj]=0, i≠j,i≠j, since they are assumed to be normal).
A good one-line summary of the linear model is the following (independence is assumed)
Y|(X1=x1,…,Xp=xp)∼N(β0+β1x1+⋯+βpxp,σ2).Y|(X1=x1,…,Xp=xp)∼N(β0+β1x1+⋯+βpxp,σ2).(3.5)

Figure 3.3: The key concepts of the simple linear model. The blue densities denote the conditional density of YY for each cut in the XX axis. The yellow band denotes where the 95%95% of the data is, according to the model. The red points represent data following the model.
Inference on the parameters ββ and σσ can be done, conditionally11 on X1,…,Xn,X1,…,Xn, from (3.5). We do not explore this further, referring the interested reader to these notes. Instead of that, we remark the connection between least squares estimation and the maximum likelihood estimator derived from (3.5).
First, note that (3.5) is the population version of the linear model (it is expressed in terms of the random variables, not in terms of their samples). The sample version that summarizes assumptions i–iv is
Y|(X1,…,Xp)∼Nn(Xβ,σ2I).Y|(X1,…,Xp)∼Nn(Xβ,σ2I).
Using this result, it is easy obtain the log-likelihood function of Y1,…,YnY1,…,Yn conditionally12 on X1,…,XnX1,…,Xn as
ℓ(β)=logϕσ2I(Y−Xβ)=n∑i=1logϕσ(Yi−(Xβ)i).ℓ(β)=logϕσ2I(Y−Xβ)=n∑i=1logϕσ(Yi−(Xβ)i).(3.6)
Finally, the next result justifies the consideration of the least squares estimate: it equals the maximum likelihood estimator derived under assumptions i–iv.
Theorem 3.1 Under assumptions i–iv, the maximum likelihood estimate of ββ is the least squares estimate (3.4):
ˆβML=argmaxβ∈Rp+1ℓ(β)=(X′X)−1XY.^βML=argmaxβ∈Rp+1ℓ(β)=(X′X)−1XY.
Proof. Expanding the first equality at (3.6) gives (|σ2I|1/2=σn|σ2I|1/2=σn)
ℓ(β)=−log((2π)n/2σn)−12σ2(Y−Xβ)′(Y−Xβ).ℓ(β)=−log((2π)n/2σn)−12σ2(Y−Xβ)′(Y−Xβ).
Optimizing ℓℓ does not require knowledge on σ2,σ2, since differentiating with respect to ββ and equating to zero gives (see Exercise 3.1) 1σ2(Y−Xβ)′X=0.1σ2(Y−Xβ)′X=0. The result follows from that.
3.1.2 Logistic regression
Model formulation
When the response YY can take only two values, codified for convenience as 11 (success) and 00 (failure), it is called a binary variable. A binary variable, known also as a Bernoulli variable, is a B(1,p).B(1,p). Recall that E[B(1,p)]=P[B(1,p)=1]=p.E[B(1,p)]=P[B(1,p)=1]=p.
If YY is a binary variable and X1,…,XpX1,…,Xp are predictors associated to Y,Y, the purpose in logistic regression is to estimate
p(x1,…,xp):=P[Y=1|X1=x1,…,Xp=xp]=E[Y|X1=x1,…,Xp=xp],p(x1,…,xp):=P[Y=1|X1=x1,…,Xp=xp]=E[Y|X1=x1,…,Xp=xp],(3.7)
this is, how the probability of Y=1Y=1 is changing according to particular values, denoted by x1,…,xp,x1,…,xp, of the predictors X1,…,Xp.X1,…,Xp. A tempting possibility is to consider a linear model for (3.7), p(x1,…,xp)=β0+β1x1+⋯+βpxp.p(x1,…,xp)=β0+β1x1+⋯+βpxp. However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one will arise.
A solution is to consider a function to encapsulate the value of z=β0+β1x1+⋯+βpxp,z=β0+β1x1+⋯+βpxp, in R,R, and map it back to [0,1].[0,1]. There are several alternatives to do so, based on distribution functions F:R⟶[0,1]F:R⟶[0,1] that deliver y=F(z)∈[0,1].y=F(z)∈[0,1]. Different choices of FF give rise to different models, the most common being the logistic distribution function:
logistic(z):=ez1+ez=11+e−z.logistic(z):=ez1+ez=11+e−z.
Its inverse, F−1:[0,1]⟶R,F−1:[0,1]⟶R, known as the logit function, is
logit(p):=logistic−1(p)=logp1−p.logit(p):=logistic−1(p)=logp1−p.
This is a link function, this is, a function that maps a given space (in this case [0,1][0,1]) into R.R. The term link function is employed in generalized linear models, which follow exactly the same philosophy of the logistic regression – mapping the domain of YY to RR in order to apply there a linear model. As said, different link functions are possible, but we will concentrate here exclusively on the logit as a link function.
The logistic model is defined as the following parametric form for (3.7):
p(x1,…,xp)=logistic(β0+β1x1+⋯+βpxp)=11+e−(β0+β1x1+⋯+βpxp).p(x1,…,xp)=logistic(β0+β1x1+⋯+βpxp)=11+e−(β0+β1x1+⋯+βpxp).(3.8)
The linear form inside the exponent has a clear interpretation:
- If β0+β1x1+⋯+βpxp=0,β0+β1x1+⋯+βpxp=0, then p(x1,…,xp)=12p(x1,…,xp)=12 (Y=1Y=1 and Y=0Y=0 are equally likely).
- If β0+β1x1+⋯+βpxp<0,β0+β1x1+⋯+βpxp<0, then p(x1,…,xp)<12p(x1,…,xp)<12 (Y=1Y=1 less likely).
- If β0+β1x1+⋯+βpxp>0,β0+β1x1+⋯+βpxp>0, then p(x1,…,xp)>12p(x1,…,xp)>12 (Y=1Y=1 more likely).
To be more precise on the interpretation of the coefficients β0,…,βpβ0,…,βp we need to introduce the concept of odds. The odds is an equivalent way of expressing the distribution of probabilities in a binary variable. Since P[Y=1]=pP[Y=1]=p and P[Y=0]=1−p,P[Y=0]=1−p, both the success and failure probabilities can be inferred from p.p. Instead of using pp to characterize the distribution of Y,Y, we can use
odds(Y)=p1−p=P[Y=1]P[Y=0].odds(Y)=p1−p=P[Y=1]P[Y=0].(3.9)
The odds is the ratio between the probability of success and the probability of failure. It is extensively used in betting due to its better interpretability. For example, if a horse YY has a probability p=2/3p=2/3 of winning a race (Y=1Y=1), then the odds of the horse is
odds=p1−p=2/31/3=2.odds=p1−p=2/31/3=2.
This means that the horse has a probability of winning that is twice larger than the probability of losing. This is sometimes written as a 2:12:1 or 2×12×1 (spelled “two-to-one”). Conversely, if the odds of YY is given, we can easily know what is the probability of success p,p, using the inverse of (3.9):
p=P[Y=1]=odds(Y)1+odds(Y).p=P[Y=1]=odds(Y)1+odds(Y).
For example, if the odds of the horse were 5,5, that would correspond to a probability of winning p=5/6.p=5/6.
Remark. Recall that the odds is a number in [0,+∞].[0,+∞]. The 00 and +∞+∞ values are attained for p=0p=0 and p=1,p=1, respectively. The log-odds (or logit) is a number in [−∞,+∞].[−∞,+∞].
We can rewrite (3.8) in terms of the odds (3.9). If we do so, we have:
odds(Y|X1=x1,…,Xp=xp)=p(x1,…,xp)1−p(x1,…,xp)=eβ0+β1x1+⋯+βpxp=eβ0eβ1x1…eβpxp.odds(Y|X1=x1,…,Xp=xp)=p(x1,…,xp)1−p(x1,…,xp)=eβ0+β1x1+⋯+βpxp=eβ0eβ1x1…eβpxp.
This provides the following interpretation of the coefficients:
- eβ0eβ0: is the odds of Y=1Y=1 when X1=…=Xp=0.X1=…=Xp=0.
- eβj,eβj, 1≤j≤k1≤j≤k: is the multiplicative increment of the odds for an increment of one unit in Xj=xj,Xj=xj, provided that the remaining variables do not change. If the increment in XjXj is of rr units, then the multiplicative increment in the odds is (eβj)r.(eβj)r.
Model assumptions and estimation
Some probabilistic assumptions are required for performing inference on the model parameters ββ from the sample (X1,Y1),…,(Xn,Yn).(X1,Y1),…,(Xn,Yn). These assumptions are somehow simpler than the ones for linear regression.

Figure 3.4: The key concepts of the logistic model. The blue bars represent the conditional distribution of probability of YY for each cut in the XX axis. The red points represent data following the model.
The assumptions of the logistic model are the following:
- Linearity in the logit13: logit(p(x))=logp(x)1−p(x)=β0+β1x1+⋯+βpxp.logit(p(x))=logp(x)1−p(x)=β0+β1x1+⋯+βpxp.
- Binariness: Y1,…,YnY1,…,Yn are binary variables.
- Independence: Y1,…,YnY1,…,Yn are independent.
A good one-line summary of the logistic model is the following (independence is assumed)
Y|(X1=x1,…,Xp=xp)∼Ber(logistic(β0+β1x1+⋯+βpxp))=Ber(11+e−(β0+β1x1+⋯+βpxp)).Y|(X1=x1,…,Xp=xp)∼Ber(logistic(β0+β1x1+⋯+βpxp))=Ber(11+e−(β0+β1x1+⋯+βpxp)).
Since Yi∼Ber(p(Xi)),Yi∼Ber(p(Xi)), i=1,…,n,i=1,…,n, the log-likelihood of β is
ℓ(β)=n∑i=1log(p(Xi)Yi(1−p(Xi))1−Yi)=n∑i=1{Yilog(p(Xi))+(1−Yi)log(1−p(Xi))}.
Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for ˆβ. These have to be obtained numerically by means of an iterative procedure, which may run into problems in low sample situations with perfect classification. Unlike in the linear model, inference is not exact from the assumptions, but approximate in terms of maximum likelihood theory. We do not explore this further and refer the interested reader to these notes.
Figure 3.5: The logistic regression fit and its dependence on β0 (horizontal displacement) and β1 (steepness of the curve). Recall the effect of the sign of β1 in the curve: if positive, the logistic curve has an ‘s’ form; if negative, the form is a reflected ‘s.’ Application also available here.
Figure 3.5 shows how the log-likelihood changes with respect to the values for (β0,β1) in three data patterns.
The data of the illustration has been generated with the following code:
# Create the data employed in Figure 3.4
# Data
set.seed(34567)
<- rnorm(50, sd = 1.5)
x <- -0.5 + 3 * x
y1 <- 0.5 - 2 * x
y2 <- -2 + 5 * x
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))
y3
# Data
<- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3) dataMle
Let’s check that indeed the coefficients given by R’s glm
are the ones that maximize the likelihood of the animation of Figure 3.5. We do so for y ~ x1
.
# Call glm
# glm employs formula = response ~ predictor1 + predictor2 + ...
# (names according to the data frame names) for denoting the regression
# to be done. We need to specify family = "binomial" to make a
# logistic regression
<- glm(y1 ~ x, family = "binomial", data = dataMle)
mod summary(mod)
##
## Call:
## glm(formula = y1 ~ x, family = "binomial", data = dataMle)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.47853 -0.40139 0.02097 0.38880 2.12362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1692 0.4725 -0.358 0.720274
## x 2.4282 0.6599 3.679 0.000234 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.315 on 49 degrees of freedom
## Residual deviance: 29.588 on 48 degrees of freedom
## AIC: 33.588
##
## Number of Fisher Scoring iterations: 6
# mod is a list with a lot of information
# str(mod) # Long output
# Coefficients
$coefficients
mod## (Intercept) x
## -0.1691947 2.4281626
# Plot the fitted regression curve
<- seq(-5, 5, l = 200)
xGrid <- 1 / (1 + exp(-(mod$coefficients[1] + mod$coefficients[2] * xGrid)))
yGrid plot(xGrid, yGrid, type = "l", col = 2, xlab = "x", ylab = "y")
points(x, y1)
Not to confuse with a sample!↩︎
If that was the case, we would consider perpendicular distances, which yield to Principal Component Analysis (PCA).↩︎
We assume that the randomness is on the response only.↩︎
We assume that the randomness is on the response only.↩︎
An equivalent way of stating this assumption is p(x)=logistic(β0+β1x1+⋯+βpxp).↩︎