## 4.2 Asymptotic properties

The asymptotic properties of the local polynomial estimator give us valuable insights into its performance. In particular, they allow answering, precisely, the following questions:

What affects the performance of the local polynomial estimator? Is local linear estimation better than local constant estimation? What is the effect of \(h\) on the estimates?

The asymptotic analysis of the local linear and local constant estimators^{130} is achieved, as done in Sections 2.3 and 3.3, by examining the asymptotic bias and variance.

In order to establish a framework for the analysis, we consider the so-called *location-scale model* for \(Y\) and its predictor \(X\):

\[\begin{align*} Y=m(X)+\sigma(X)\varepsilon, \end{align*}\]

where

\[\begin{align*} \sigma^2(x):=\mathbb{V}\mathrm{ar}[Y| X=x] \end{align*}\]

is the *conditional variance* of \(Y\) given \(X\), and \(\varepsilon\) is such that \(\mathbb{E}[\varepsilon]=0\) and \(\mathbb{V}\mathrm{ar}[\varepsilon]=1\). Recall that, since the conditional variance is not forced to be constant, we are implicitly allowing for *heteroskedasticity*.^{131}

Note that for the derivation of the Nadaraya–Watson estimator and the local polynomial estimator we did not assume any particular assumption, beyond the (implicit) differentiability of \(m\) up to order \(p\) for the local polynomial estimator. The following assumptions^{132} are the only requirements to perform the asymptotic analysis of the estimator:

**A1**.^{133}\(m\) is twice continuously differentiable.**A2**.^{134}\(\sigma^2\) is continuous and positive.**A3**.^{135}\(f\), the marginal pdf of \(X\), is continuously differentiable and*bounded away from zero*.^{136}**A4**.^{137}The kernel \(K\) is a symmetric and bounded pdf with finite second moment and is square integrable.**A5**.^{138}\(h=h_n\) is a deterministic sequence of bandwidths such that, when \(n\to\infty\), \(h\to0\) and \(nh\to\infty\).

The bias and variance are studied in their *conditional* versions on the predictor’s sample \(X_1,\ldots,X_n\). The reason for analyzing the conditional instead of the *unconditional* versions is to avoid technical difficulties that integration with respect to the unknown predictor’s density may pose. This is in the spirit of what was done in parametric inference (observe Sections B.1.2 and B.2.2).

The main result follows. It provides useful insights into the effect of \(p\), \(m\), \(f\) (standing from now on for the marginal pdf of \(X\)), and \(\sigma^2\) in the performance of \(\hat{m}(\cdot;p,h)\) for \(p=0,1\).

**Theorem 4.1 **Under **A1**–**A5**, the conditional bias and variance of the local constant (\(p=0\)) and local linear (\(p=1\)) estimators are

\[\begin{align} \mathrm{Bias}[\hat{m}(x;p,h)| X_1,\ldots,X_n]&=B_p(x)h^2+o_\mathbb{P}(h^2),\tag{4.16}\\ \mathbb{V}\mathrm{ar}[\hat{m}(x;p,h)| X_1,\ldots,X_n]&=\frac{R(K)}{nhf(x)}\sigma^2(x)+o_\mathbb{P}((nh)^{-1}),\tag{4.17} \end{align}\]

where

\[\begin{align*} B_p(x):=\begin{cases} \frac{\mu_2(K)}{2}\left\{m''(x)+2\frac{m'(x)f'(x)}{f(x)}\right\},&\text{ if }p=0,\\ \frac{\mu_2(K)}{2}m''(x),&\text{ if }p=1. \end{cases} \end{align*}\]

*Remark. * The little-\(o_\mathbb{P}\)s in (4.16) and (4.17) appear (instead of little-\(o\)s as in Theorem 2.1) because \(\mathrm{Bias}[\hat{m}(x;p,h)| X_1,\ldots,X_n]\) and \(\mathbb{V}\mathrm{ar}[\hat{m}(x;p,h)| X_1,\ldots,X_n]\) are *random variables*. Then, the asymptotic expansions of these random variables have stochastic remainders that converge to zero *in probability* at specific rates.

The bias and variance expressions (4.16) and (4.17) yield very interesting insights:

Bias:

The

**bias decreases with \(h\)**for both \(p=0,1\). That means that small bandwidths \(h\) give estimators with low bias, whereas large bandwidths provide largely biased estimators.*quadratically*For \(p=1\), the bias at \(x\) is directly proportional to \(m''(x)\). Therefore:

- The
*bias is negative*in regions where \(m\) is concave, i.e., \(\{x\in\mathbb{R}:m''(x)<0\}\). These regions correspond to*peaks and local maxima of \(m\)*. - Conversely, the
*bias is positive*in regions where \(m\) is convex, i.e., \(\{x\in\mathbb{R}:m''(x)>0\}\). These regions correspond to*valleys and local minima of \(m\)*. - All in all,
**the “wilder” the curvature**of \(m\), the larger the bias and**the harder to estimate \(m\)**.

- The
For \(p=0\), the bias at \(x\) is more convoluted and is affected by \(m''(x)\), \(m'(x)\), \(f'(x)\), and \(f(x)\):

- The quantities \(m'(x)\), \(f'(x)\), and \(f(x)\) are not present in the bias when \(p=1\). Precisely, for the local constant estimator, the lower the density \(f(x)\), the larger the bias (in absolute value). Also, the faster \(m\) and \(f\) change at \(x\) (derivatives), the larger the bias. Thus
**the bias of the local constant estimator is much more sensitive to \(m(x)\) and \(f(x)\)**than the local linear (which is sensitive to \(m''(x)\) only). Particularly, the fact that it depends on \(f'(x)\) and \(f(x)\) is referred to as the*design bias*since it depends merely on the predictor’s distribution. - As for \(p=1\), \(m''(x)\) contributes to the bias when \(p=0\), this contribution being
*negative*in regions corresponding to peaks and local maxima of \(m\), and*positive*in the valleys and local minima of \(m\). In general, the “wilder” the curvature of \(m\), the larger its contribution to the bias and the harder to estimate \(m\).

- The quantities \(m'(x)\), \(f'(x)\), and \(f(x)\) are not present in the bias when \(p=1\). Precisely, for the local constant estimator, the lower the density \(f(x)\), the larger the bias (in absolute value). Also, the faster \(m\) and \(f\) change at \(x\) (derivatives), the larger the bias. Thus

Variance:

- The main term of the
**variance is the same for \(p=0,1\)**. In addition, it depends directly on \(\frac{\sigma^2(x)}{f(x)}\). As a consequence, the lower the density, the more variable \(\hat{m}(x;p,h)\) is.^{139}Also, the larger the conditional variance at \(x\), \(\sigma^2(x)\), the more variable \(\hat{m}(x;p,h)\) is.^{140} - The
**variance decreases at a factor of \((nh)^{-1}\)**. This is related to the so-called*effective sample size*\(nh\), which can be thought of as the amount of data in the neighborhood of \(x\) that is employed for performing the regression.^{141}

- The main term of the

All in all, the main takeaway of the analysis of \(p=0\) vs. \(p=1\) is:

\(p=1\) has, in general, smaller bias than that of \(p=0\)(but of the same order) whilekeeping the same variance as \(p=0\).

An extended version of Theorem 4.1, given in Theorem 3.1 in Fan and Gijbels (1996), shows that this phenomenon extends to higher orders: **odd order** (\(p=2\nu+1\), \(\nu\in\mathbb{N}\)) polynomial fits introduce an extra coefficient for the polynomial fit that allows them to **reduce the bias**, while maintaining the **same variance** of the *precedent*^{142} even order (\(p=2\nu\)). So, for example, local cubic fits are preferred to local quadratic fits. This motivates the claim that *local polynomial fitting is an odd world* (Fan and Gijbels (1996)).

Finally, we have the asymptotic pointwise normality of the estimator, an analogous result to Theorem 2.2 which is helpful to obtain pointwise confidence intervals for \(m\) afterwards.

**Theorem 4.2 **Assume that \(\mathbb{E}[(Y-m(x))^{2+\delta}\vert X=x]<\infty\) for some \(\delta>0\). Then, under **A1**–**A5**,

\[\begin{align} &\sqrt{nh}(\hat m(x;p,h)-\mathbb{E}[\hat m(x;p,h)|X_1,\ldots,X_n])\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,\frac{R(K)\sigma^2(x)}{f(x)}\right),\tag{4.18}\\ &\sqrt{nh}\left(\hat m(x;p,h)-m(x)-B_p(x)h^2\right)\stackrel{d}{\longrightarrow}\mathcal{N}\left(0,\frac{R(K)\sigma^2(x)}{f(x)}\right).\tag{4.19} \end{align}\]

**Exercise 4.10 **Theorem 4.1 gives some additional insights with respect to \(B_p(x)\), the dominating term of the bias:

That is, for each of these two cases, \(\mathrm{Bias}[\hat{m}(x;p,h)| X_1,\ldots,X_n]=o_\mathbb{P}(h^2)\). The local constant and local linear estimators are actually *exactly* unbiased when estimating constant and linear regression functions, respectively. That is, \(\mathbb{E}_c[\hat{m}(x;0,h)| X_1,\ldots,X_n]=c\) and \(\mathbb{E}_{a,b}[\hat{m}(x;1,h)| X_1,\ldots,X_n]=ax+b\), where \(\mathbb{E}_c[\cdot|X_1,\ldots,X_n]\) and \(\mathbb{E}_{a,b}[\cdot|X_1,\ldots,X_n]\) represent the conditional expectations under the constant and linear models, respectively. Prove these two results.

### References

*Local Polynomial Modelling and Its Applications*. Vol. 66. Monographs on Statistics and Applied Probability. London: Chapman & Hall. https://doi.org/10.1201/9780203748725.

We do not address the analysis of the general case in which \(p\geq1\). The reader is referred to, for example, Theorem 3.1 in Fan and Gijbels (1996) for the full analysis.↩︎

In linear models, homoscedasticity is one of the key assumptions for performing inference (Section B.1.2).↩︎

Recall that these are the only assumptions done in the model so far. Compared with the ones linear models or generalized linear models make, they are extremely mild. Recall \(Y\) is not assumed to be continuous.↩︎

This assumption requires certain smoothness of the regression function, allowing thus for Taylor expansions to be performed. This assumption is important in practice: \(\hat{m}(\cdot;p,h)\) is infinitely differentiable if the considered kernels \(K\) are so too.↩︎

It avoids the situation in which \(Y\) is a degenerated random variable.↩︎

It avoids the degenerate situation in which \(m\) is estimated at regions without observations of the predictors (such as

*holes*in the support of \(X\)).↩︎Meaning that there exists a positive lower bound for \(f\).↩︎

Mild assumption inherited from the kde.↩︎

Key assumption for reducing the bias and variance of \(\hat{m}(\cdot;p,h)\)

*simultaneously*.↩︎Recall that this makes perfect sense: low-density regions of \(X\) imply less information available about \(m\).↩︎

The same happened in the the linear model with the error variance \(\sigma^2\).↩︎

The variance of an unweighted mean is reduced by a factor \(n^{-1}\) when \(n\) observations are employed. To compute \(\hat{m}(x;p,h)\), \(n\) observations are used but in a

*weighted*fashion that roughly amounts to considering \(nh\)*unweighted*observations.↩︎Since the variance increases as \(\nu\) does, not as \(p\) does.↩︎

\(m(x)=c\) for all \(x\in\mathbb{R}\) and given \(c\in\mathbb{R}\).↩︎

\(m(x)=ax+b\) for all \(x\in\mathbb{R}\) and given \(a,b\in\mathbb{R}\).↩︎