6 Nonparametric Regression
Consider the general Regression Problem Y=m(X)+ϵ where Y (response) and X (covariate) are real-valued random variables andwe assume to have data (Yi,Xi),...,(Yn,Xn) such that (Yi,Xi)∼FX,Y. Observe that (as long as Y has finite expectation), this model always holds!
6.1 Polynomial Regression
In linear regression, we make an assumption that m is linear, that is m(x)=β0+β1x for some unknown β0,β1∈R. If this relation holds, we can find ˆβ0 and ˆβ1 that minimize the least squared error n∑i=1(Yi−β0−β1Xi)2 that is, the OLS estimator. Very often, however, the a priori assumed linear relation is too simple. \begin{example} We want to infer on the effect of fertilizer (X) on crop yield (Y). We observe the data (in in Kg/ha)
<-c(801,913,501,767,98,8,484)## amount of fertilizer
x<-c(68.13 , 85.26 , 68.01 , 73.30 , 51.14 , 37.15 , 74.29)# Crop Yield y
The OLS estimates for ˆβ0 and ˆβ1 are (approx)
lm(y~x)$coefficients
## (Intercept) x
## 44.07862674 0.04163763
One can also plot the corresponding regression line
<-seq(0,1000, by = 10)
gridplot(x=grid, y=(44.08+ 0.04*grid), type = "l", ylab = "Crop Yield", xlab = "Fertilizer")
points(x=x, y=y)
This model predicts that the more fertilizer we use, the better the crop yield will be. However, this is not realistic and at some point we might expect the crop yield to deteriorate. Assume we also observe the additional data:
<-c(x, 1435 , 1165 , 1255 , 1125 , 1308)## amount of fertilizer
x.add<-c(y,58.12 , 78.29 , 57.44 , 73.38 , 80.44)# Crop Yield y.add
The fitted regression taking into account the new data is plotted below as the solid line and the old one is depicted as the dashed line.
###new linear regression
lm(y.add~x.add)$coefficients
## (Intercept) x.add
## 54.52784570 0.01527544
<-seq(0,1500, by = 10)
grid.addplot(x=grid.add, y=(44.08+ 0.04*grid.add), type = "l", ylab = "crop yield", xlab = "amount of fertilizer", lty =2)
lines(x=grid.add, y=(54.53+ 0.015*grid.add), type = "l")
points(x=x.add, y=y.add)
The regression line using only the old data looks quite misspecified, the one that uses all the data looks a bit better. However, especially for low and high values of fertilizer, the prediction is not particularly good as well. The problem is that the data do not really indicate a linear relation between Y and X.
A possible way out is to still use linear models, but to allow the regression curve to have other shapes than linear. for instance, one might fit a quadratic relation and assume m(x)=β0+β1x+β2x2. In that case, we can still use the OLS Estimator to derive
lm(y.add~poly(x.add,2,raw=TRUE))$coefficients
## (Intercept) poly(x.add, 2, raw = TRUE)1 poly(x.add, 2, raw = TRUE)2
## 39.528880345 0.085859379 -0.000049017
Especially at the ends of the curves, the prediction seems to be better than for the regression line:
plot(x=grid.add, y=(39.52+ 0.085*grid.add+-0.00005*grid.add^2),ylim = c(30,90), type = "l", ylab = "crop yield", xlab = "amount of fertilizer")
lines(x=grid.add, y=(54.53+ 0.015*grid.add), type = "l", lty =2)
points(x=x.add, y=y.add)
The previous example indicates that increasing the complexity of the model can sometimes help to achieve better predictions. Why don’t we just assume a very high polynomial degree p, such that the model can fit very intricate patterns perfectly?
Assume that we assume the degree of the polynomial is p=10, that is, m(x)=β10x10+...+β1x+β0? In this case, the model could recognize any polynomial relationship up to order 10, but the fit does not look very pleasing.
.10<-lm(y.add~poly(x.add,10, raw = TRUE))$coefficients
LMlength(grid.add)
## [1] 151
<-numeric(151)+as.numeric(LM.10[1])
fitfor (i in 1:10){
<-fit + as.numeric(LM.10[i+1])*(grid.add^i)
fit
}
plot(x=grid.add, y=fit, type = "l", ylab = "crop yield", xlab = "amount of fertilizer")
points(x=x.add, y=y.add)
Obviously, the model overfits the data. Indeed, if we would have left just one point out of the regression, the fitted curve would look very different.
10.Lastleftout<-lm(y.add[-12]~poly(x.add[-12],10,raw =TRUE))$coefficients
LM.<-numeric(151)+as.numeric(LM.10.Lastleftout[1])
fit.Lastleftoutfor (i in 1:10){
<-fit.Lastleftout + as.numeric(LM.10.Lastleftout[i+1])*(grid.add^i)
fit.Lastleftout
}plot(x=grid.add, y=fit.Lastleftout, type = "l", ylab = "crop yield", xlab = "amount of fertilizer")
points(x=x.add, y=y.add)
The predicted value for the point that was left out is approximately −70.55, while the corresponding value is 80.44. Arguably, using a polynomial of degree 10 would lead to very unreliable predictions.
How can we decide on the bases of the data, which model we should choose? This is quite difficult in general one thing we could do is derive an adequate degree of the polynomial ba cross-validation. that is, we choose p∈N such that it minimizes CV(p)=n∑i=1(Yi−mp,−i(Xi))2 where mp,−i is the fitted (via OLS) regression polynomial of order p based on the data (Y1,X1),...,(Yi−1,Xi−1),(Yi+1,Xi+1),...,(Yn,Xn).
While in practise the cross-validations yields a reasonable criterion to choose the polynomial degree, one can also interpret theoretically why choosing a degree that is unnecessarily high leads to bad results. Ideed, one can show that the MSE of the OLS estimator is MASE=E[n∑i=1(mp(Xi)−ˆmp(Xi))2]=(p+1)σ2n.
Often, however, polynomial relations are not an adequate assumption and polynomials are not adequate basis functions for a series estimator. Another way is to use purely nonparametric kernel estimators.
6.2 Nadaraya Watson Estimators
The idea of the local kernel estimators is to estimate values of m(x) for different x by a weighted average of sampled response values, weighting points that correspond to an Xi-value which is close to x higher than others. Precisely, the Nadaraya-Watson estimator takes the form ˆmh(x)=1nhn∑i=1K(x−Xjh)∑nj=1K(x−Xih)Yi, where There are many possible choices for kernels, but popular ones includeKernel estimators which use the Gaussian kernel take into account all available data to estimate a point m(x), while the estimators using the Epanechnikov or the biweight kernel just take into account Xi’s whose distance to x is lower than h. While an adequate choice of a kernel is important, it is crucial to choose an appropriate bandwidth h. If the bandwidth is chosen too low, kernel estimators tend to overfit the data, while a bandwidth that is too high oversmoothes.