4.2 Asymptotic properties
The asymptotic properties of the local polynomial estimator give us valuable insights into its performance. In particular, they allow answering, precisely, the following questions:
What affects the performance of the local polynomial estimator? Is local linear estimation better than local constant estimation? What is the effect of \(h\) on the estimates?
The asymptotic analysis of the local linear and local constant estimators131 is achieved, as done in Sections 2.3 and 3.3, by examining the asymptotic bias and variance.
In order to establish a framework for the analysis, we consider the so-called location-scale model for \(Y\) and its predictor \(X\):
\[\begin{align*} Y=m(X)+\sigma(X)\varepsilon, \end{align*}\]
where
\[\begin{align*} \sigma^2(x):=\mathbb{V}\mathrm{ar}[Y| X=x] \end{align*}\]
is the conditional variance of \(Y\) given \(X,\) and \(\varepsilon\) is such that \(\mathbb{E}[\varepsilon]=0\) and \(\mathbb{V}\mathrm{ar}[\varepsilon]=1.\) Recall that, since the conditional variance is not forced to be constant, we are implicitly allowing for heteroskedasticity.132
Note that for the derivation of the Nadaraya–Watson estimator and the local polynomial estimator we did not assume any particular assumption, beyond the (implicit) differentiability of \(m\) up to order \(p\) for the local polynomial estimator. The following assumptions133 are the only requirements to perform the asymptotic analysis of the estimator:
- A1.134 \(m\) is twice continuously differentiable.
- A2.135 \(\sigma^2\) is continuous and positive.
- A3.136 \(f,\) the marginal pdf of \(X,\) is continuously differentiable and bounded away from zero.137
- A4.138 The kernel \(K\) is a symmetric and bounded pdf with finite second moment and is square integrable.
- A5.139 \(h=h_n\) is a deterministic sequence of bandwidths such that, when \(n\to\infty,\) \(h\to0\) and \(nh\to\infty.\)
The bias and variance are studied in their conditional versions on the predictor’s sample \(X_1,\ldots,X_n.\) The reason for analyzing the conditional instead of the unconditional versions is to avoid technical difficulties that integration with respect to the unknown predictor’s density may pose. This is in the spirit of what was done in parametric inference (observe Sections B.1.2 and B.2.2).
The main result follows. It provides useful insights into the effect of \(p,\) \(m,\) \(f\) (standing from now on for the marginal pdf of \(X\)), and \(\sigma^2\) in the performance of \(\hat{m}(\cdot;p,h)\) for \(p=0,1.\)
Theorem 4.1 Under A1–A5, the conditional bias and variance of the local constant (\(p=0\)) and local linear (\(p=1\)) estimators are
\[\begin{align} \mathrm{Bias}[\hat{m}(x;p,h)| X_1,\ldots,X_n]&=B_p(x)h^2+o_\mathbb{P}(h^2),\tag{4.16}\\ \mathbb{V}\mathrm{ar}[\hat{m}(x;p,h)| X_1,\ldots,X_n]&=\frac{R(K)}{nhf(x)}\sigma^2(x)+o_\mathbb{P}((nh)^{-1}),\tag{4.17} \end{align}\]
where
\[\begin{align*} B_p(x):=\begin{cases} \frac{\mu_2(K)}{2}\left\{m''(x)+2\frac{m'(x)f'(x)}{f(x)}\right\},&\text{ if }p=0,\\ \frac{\mu_2(K)}{2}m''(x),&\text{ if }p=1. \end{cases} \end{align*}\]
Remark. The little-\(o_\mathbb{P}\)s in (4.16) and (4.17) appear (instead of little-\(o\)s as in Theorem 2.1) because \(\mathrm{Bias}[\hat{m}(x;p,h)| X_1,\ldots,X_n]\) and \(\mathbb{V}\mathrm{ar}[\hat{m}(x;p,h)| X_1,\ldots,X_n]\) are random variables. Then, the asymptotic expansions of these random variables have stochastic remainders that converge to zero in probability at specific rates.
The bias and variance expressions (4.16) and (4.17) yield very interesting insights:
Bias:
The bias decreases with \(h\) quadratically for both \(p=0,1.\) That means that small bandwidths \(h\) give estimators with low bias, whereas large bandwidths provide largely biased estimators.
For \(p=1,\) the bias at \(x\) is directly proportional to \(m''(x).\) Therefore:
- The bias is negative in regions where \(m\) is concave, i.e., \(\{x\in\mathbb{R}:m''(x)<0\}.\) These regions correspond to peaks and local maxima of \(m.\)
- Conversely, the bias is positive in regions where \(m\) is convex, i.e., \(\{x\in\mathbb{R}:m''(x)>0\}.\) These regions correspond to valleys and local minima of \(m.\)
- All in all, the “wilder” the curvature of \(m,\) the larger the bias and the harder to estimate \(m\).
For \(p=0,\) the bias at \(x\) is more convoluted and is affected by \(m''(x),\) \(m'(x),\) \(f'(x),\) and \(f(x)\):
- The quantities \(m'(x),\) \(f'(x),\) and \(f(x)\) are not present in the bias when \(p=1.\) Precisely, for the local constant estimator, the lower the density \(f(x),\) the larger the bias (in absolute value). Also, the faster \(m\) and \(f\) change at \(x\) (derivatives), the larger the bias. Thus the bias of the local constant estimator is much more sensitive to \(m(x)\) and \(f(x)\) than the local linear (which is sensitive to \(m''(x)\) only). Particularly, the fact that it depends on \(f'(x)\) and \(f(x)\) is referred to as the design bias since it depends merely on the predictor’s distribution.
- As for \(p=1,\) \(m''(x)\) contributes to the bias when \(p=0,\) this contribution being negative in regions corresponding to peaks and local maxima of \(m,\) and positive in the valleys and local minima of \(m.\) In general, the “wilder” the curvature of \(m,\) the larger its contribution to the bias and the harder to estimate \(m.\)
Variance:
- The main term of the variance is the same for \(p=0,1\). In addition, it depends directly on \(\frac{\sigma^2(x)}{f(x)}.\) As a consequence, the lower the density, the more variable \(\hat{m}(x;p,h)\) is.140 Also, the larger the conditional variance at \(x,\) \(\sigma^2(x),\) the more variable \(\hat{m}(x;p,h)\) is.141
- The variance decreases as a factor of \((nh)^{-1}\). This is related to the so-called effective sample size \(nh,\) which can be thought of as the amount of data in the neighborhood of \(x\) that is employed for performing the regression.142
All in all, the main takeaway of the analysis of \(p=0\) vs. \(p=1\) is:
\(p=1\) has, in general, smaller bias than that of \(p=0\) (but of the same order) while keeping the same variance as \(p=0\).
An extended version of Theorem 4.1, given in Theorem 3.1 in Fan and Gijbels (1996), shows that this phenomenon extends to higher orders: odd order (\(p=2\nu+1,\) \(\nu\in\mathbb{N}\)) polynomial fits introduce an extra coefficient for the polynomial fit that allows them to reduce the bias, while maintaining the same variance of the precedent143 even order (\(p=2\nu\)). So, for example, local cubic fits are preferred to local quadratic fits. This motivates the claim that local polynomial fitting is an odd world (Fan and Gijbels (1996)).
Finally, we have the asymptotic pointwise normality of the estimator, an analogous result to Theorem 2.2 which is helpful to obtain pointwise confidence intervals for \(m\) afterwards.
Theorem 4.2 Assume that \(\mathbb{E}[(Y-m(x))^{2+\delta}\vert X=x]<\infty\) for some \(\delta>0.\) Then, under A1–A5,
\[\begin{align} &\sqrt{nh}(\hat m(x;p,h)-\mathbb{E}[\hat m(x;p,h)|X_1,\ldots,X_n])\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,\frac{R(K)\sigma^2(x)}{f(x)}\right),\tag{4.18}\\ &\sqrt{nh}\left(\hat m(x;p,h)-m(x)-B_p(x)h^2\right)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,\frac{R(K)\sigma^2(x)}{f(x)}\right).\tag{4.19} \end{align}\]
Exercise 4.10 Theorem 4.1 gives some additional insights with respect to \(B_p(x),\) the dominating term of the bias:
That is, for each of these two cases, \(\mathrm{Bias}[\hat{m}(x;p,h)| X_1,\ldots,X_n]=o_\mathbb{P}(h^2).\) The local constant and local linear estimators are actually exactly unbiased when estimating constant and linear regression functions, respectively. That is, \(\mathbb{E}_c[\hat{m}(x;0,h)| X_1,\ldots,X_n]=c\) and \(\mathbb{E}_{a,b}[\hat{m}(x;1,h)| X_1,\ldots,X_n]=ax+b,\) where \(\mathbb{E}_c[\cdot|X_1,\ldots,X_n]\) and \(\mathbb{E}_{a,b}[\cdot|X_1,\ldots,X_n]\) represent the conditional expectations under the constant and linear models, respectively. Prove these two results.
References
We do not address the analysis of the general case in which \(p\geq1.\) The reader is referred to, for example, Theorem 3.1 in Fan and Gijbels (1996) for the full analysis.↩︎
In linear models, homoscedasticity is one of the key assumptions for performing inference (Section B.1.2).↩︎
Recall that these are the only assumptions done in the model so far. Compared with the ones linear models or generalized linear models make, they are extremely mild. Recall \(Y\) is not assumed to be continuous.↩︎
This assumption requires certain smoothness of the regression function, allowing thus for Taylor expansions to be performed. This assumption is important in practice: \(\hat{m}(\cdot;p,h)\) is infinitely differentiable if the considered kernels \(K\) are so too.↩︎
It avoids the situation in which \(Y\) is a degenerated random variable.↩︎
It avoids the degenerate situation in which \(m\) is estimated at regions without observations of the predictors (such as holes in the support of \(X\)).↩︎
Meaning that there exists a positive lower bound for \(f.\)↩︎
Mild assumption inherited from the kde.↩︎
Key assumption for reducing the bias and variance of \(\hat{m}(\cdot;p,h)\) simultaneously.↩︎
Recall that this makes perfect sense: low-density regions of \(X\) imply less information available about \(m.\)↩︎
The same happened in the linear model with the error variance \(\sigma^2.\)↩︎
The variance of an unweighted mean is reduced by a factor \(n^{-1}\) when \(n\) observations are employed. To compute \(\hat{m}(x;p,h),\) \(n\) observations are used but in a weighted fashion that roughly amounts to considering \(nh\) unweighted observations.↩︎
Since the variance increases as \(\nu\) does, not as \(p\) does.↩︎
\(m(x)=c\) for all \(x\in\mathbb{R}\) and given \(c\in\mathbb{R}.\)↩︎
\(m(x)=ax+b\) for all \(x\in\mathbb{R}\) and given \(a,b\in\mathbb{R}.\)↩︎