5.1 Kernel regression with mixed multivariate data
5.1.1 Multivariate kernel regression
We start by addressing the first generalization:
How to extend the local polynomial estimator \(\hat{m}(\cdot;q,h)\,\)156 to deal with \(p\) continuous predictors?
Although this can be done for \(q\geq0\), we focus on the local constant and linear estimators (\(q=0,1\)) to avoid excessive technical complications.157
The initial step is to state what the population object to be estimated is, which is now the function \(m:\mathbb{R}^p\longrightarrow\mathbb{R}\) given by
\[\begin{align} m(x):=\mathbb{E}[Y\vert \mathbf{X}=\mathbf{x}]=\int yf_{Y\vert \mathbf{X}=\mathbf{x}}(y)\mathrm{d}y,\tag{5.1} \end{align}\]
where \(\mathbf{X}\) denotes the random vector of the predictors, \(\mathbf{X}=(X_1,\ldots,X_p)'\). Therefore, despite having a form slightly different from (4.1) (the integrals are now multivariate), (4.1) and (5.1) are conceptually equal. As a consequence, the density-based view of the \(m\) holds:
\[\begin{align} m(x)=\frac{\int y f(\mathbf{x},y)\,\mathrm{d}y}{f_\mathbf{X}(\mathbf{x})},\tag{5.2} \end{align}\]
where \(f\) is the joint density of \((\mathbf{X},Y)\) and \(f_\mathbf{X}\) is the marginal pdf of \(\mathbf{X}\).
Therefore, given a sample \((\mathbf{X}_1,Y_1),\ldots,(\mathbf{X}_n,Y_n)\), we can estimate \(f\) and \(f_\mathbf{X}\), analogously to how we did in Section 4.1.1, by the kde’s158
\[\begin{align} \hat{f}(\mathbf{x},y;\mathbf{H},h)&=\frac{1}{n}\sum_{i=1}^nK_{\mathbf{H}}(\mathbf{x}-\mathbf{X}_{i})K_{h}(y-Y_{i}),\tag{5.3}\\ \hat{f}_\mathbf{X}(\mathbf{x};\mathbf{H})&=\frac{1}{n}\sum_{i=1}^nK_{\mathbf{H}}(\mathbf{x}-\mathbf{X}_{i}).\tag{5.4} \end{align}\]
Plugging these estimators in (5.2), we readily obtain the Nadaraya–Watson estimator for multivariate predictors:
\[\begin{align} \hat{m}(\mathbf{x};0,\mathbf{H}):=\sum_{i=1}^n\frac{K_\mathbf{H}(\mathbf{x}-\mathbf{X}_i)}{\sum_{i=1}^nK_\mathbf{H}(\mathbf{x}-\mathbf{X}_i)}Y_i=\sum_{i=1}^nW^0_{i}(\mathbf{x})Y_i, \tag{5.5} \end{align}\]
where
\[\begin{align*} W^0_{i}(\mathbf{x}):=\frac{K_\mathbf{H}(\mathbf{x}-\mathbf{X}_i)}{\sum_{i=1}^nK_\mathbf{H}(\mathbf{x}-\mathbf{X}_i)}. \end{align*}\]
Usually, to avoid a quick escalation of the number of smoothing bandwidths159, it is customary to consider product kernels for smoothing \(\mathbf{X}\), that is, to consider a diagonal bandwidth \(\mathbf{H}=\mathrm{diag}(\mathbf{h}^2)\), which gives the estimator implemented by np
:
\[\begin{align*} \hat{m}(\mathbf{x};0,\mathbf{h}):=\sum_{i=1}^nW^0_{i}(\mathbf{x})Y_i \end{align*}\]
where
\[\begin{align*} W^0_{i}(\mathbf{x})&=\frac{K_\mathbf{h}(\mathbf{x}-\mathbf{X}_i)}{\sum_{i=1}^nK_\mathbf{h}(\mathbf{x}-\mathbf{X}_i)},\\ K_\mathbf{h}(\mathbf{x}-\mathbf{X}_i)&=K_{h_1}(x_1-X_{i1})\times\stackrel{p}{\cdots}\times K_{h_p}(x_p-X_{ip}). \end{align*}\]
As in the univariate case, the Nadaraya–Watson estimate can be seen as a weighted average of \(Y_1,\ldots,Y_n\) by means of the weights \(\{W_i^0(\mathbf{x})\}_{i=1}^n\). Therefore, the Nadaraya–Watson estimator is a local mean of \(Y_1,\ldots,Y_n\) about \(\mathbf{X}=\mathbf{x}\).
The derivation of the local linear estimator involves slightly more complex arguments, but analogous to the extension of the linear model from univariate to multivariate predictors. Considering the Taylor expansion
\[\begin{align*} m(\mathbf{X}_i)\approx m(\mathbf{x})+\mathrm{D}m(\mathbf{x})'(\mathbf{X}_i-\mathbf{x}) \end{align*}\]
instead of (4.7) it is possible to arrive to the analogous160 of (4.10),
\[\begin{align*} \hat{\boldsymbol{\beta}}_\mathbf{h}:=\arg\min_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}}\sum_{i=1}^n\left(Y_i-\boldsymbol{\beta}'(1,(\mathbf{X}_i-\mathbf{x})')'\right)^2K_\mathbf{h}(\mathbf{x}-\mathbf{X}_i), \end{align*}\]
and then solve the problem in the exact same way but now considering the design matrix161
\[\begin{align*} \mathbf{X}:=\begin{pmatrix} 1 & (\mathbf{X}_1-\mathbf{x})'\\ \vdots & \vdots\\ 1 & (\mathbf{X}_n-\mathbf{x})'\\ \end{pmatrix}_{n\times(p+1)} \end{align*}\]
and
\[\begin{align*} \mathbf{W}:=\mathrm{diag}(K_{\mathbf{h}}(\mathbf{X}_1-\mathbf{x}),\ldots, K_\mathbf{h}(\mathbf{X}_n-\mathbf{x})). \end{align*}\]
The estimate162 for \(m(\mathbf{x})\) is therefore obtained from the solution of the weighted least squares problem
\[\begin{align*} \hat{\boldsymbol{\beta}}_h&=\arg\min_{\boldsymbol{\beta}\in\mathbb{R}^{p+1}} (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})'\mathbf{W}(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})\\ &=(\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}\mathbf{X}'\mathbf{W}\mathbf{Y} \end{align*}\]
as
\[\begin{align} \hat{m}(\mathbf{x};1,h):=&\,\hat{\beta}_{\mathbf{h},0}\nonumber\\ =&\,\mathbf{e}_1'(\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}\mathbf{X}'\mathbf{W}\mathbf{Y}\tag{5.6}\\ =&\,\sum_{i=1}^nW^1_{i}(\mathbf{x})Y_i,\nonumber \end{align}\]
where
\[\begin{align*} W^1_{i}(\mathbf{x}):=\mathbf{e}_1'(\mathbf{X}'\mathbf{W}\mathbf{X})^{-1}\mathbf{X}'\mathbf{W}\mathbf{e}_i. \end{align*}\]
Exercise 5.2 Implement a function to compute \(\hat{m}(\mathbf{x};1,h)\) for an arbitrary dimension \(p\). The function must receive as arguments the sample \((\mathbf{X}_1,Y_1),\ldots,(\mathbf{X}_n,Y_n)\), the bandwidth vector \(\mathbf{h}\), and the evaluation point \(\mathbf{x}\). Implement the function by computing the design matrix \(\mathbf{X}\), the weight matrix \(\mathbf{W}\), the vector of responses \(\mathbf{Y}\), and then applying (5.6). Test the implementation in the example below.
The following code implements the local regression estimators in the bivariate case and exemplifies how the use of np::npregbw
and np::npreg
is analogous in this case.
# Sample data from a bivariate regression
n <- 300
set.seed(123456)
X <- mvtnorm::rmvnorm(n = n, mean = c(0, 0),
sigma = matrix(c(2, 0.5, 0.5, 1.5), nrow = 2, ncol = 2))
m <- function(x) 0.5 * (x[, 1]^2 + x[, 2]^2)
epsilon <- rnorm(n)
Y <- m(x = X) + epsilon
# Plot sample and regression function
rgl::plot3d(x = X[, 1], y = X[, 2], z = Y, xlim = c(-3, 3), ylim = c(-3, 3),
zlim = c(-4, 10), xlab = "X1", ylab = "X2", zlab = "Y")
lx <- ly <- 50
x_grid <- seq(-3, 3, l = lx)
y_grid <- seq(-3, 3, l = ly)
xy_grid <- as.matrix(expand.grid(x_grid, y_grid))
rgl::surface3d(x = x_grid, y = y_grid,
z = matrix(m(xy_grid), nrow = lx, ncol = ly),
col = "lightblue", alpha = 1, lit = FALSE)
# Local constant fit
# An alternative for calling np::npregbw without formula
bw0 <- np::npregbw(xdat = X, ydat = Y, regtype = "lc")
kre0 <- np::npreg(bws = bw0, exdat = xy_grid) # Evaluation grid is now a matrix
rgl::surface3d(x = x_grid, y = y_grid,
z = matrix(kre0$mean, nrow = lx, ncol = ly),
col = "red", alpha = 0.25, lit = FALSE)
# Local linear fit
bw1 <- np::npregbw(xdat = X, ydat = Y, regtype = "ll")
kre1 <- np::npreg(bws = bw1, exdat = xy_grid)
rgl::surface3d(x = x_grid, y = y_grid,
z = matrix(kre1$mean, nrow = lx, ncol = ly),
col = "green", alpha = 0.25, lit = FALSE)
rgl::rglwidget()
Let’s see an application of multivariate kernel regression for the wine.csv
dataset. The objective in this dataset is to explain and predict the quality of a vintage, measured as its Price
, by means of predictors associated with the vintage.
# Load the wine dataset
wine <- read.table(file = "datasets/wine.csv", header = TRUE, sep = ",")
# Bandwidth by CV for local linear estimator -- a product kernel with
# 4 bandwidths
# Employs 4 random starts for minimizing the CV surface
bw_wine <- np::npregbw(formula = Price ~ Age + WinterRain + AGST +
HarvestRain, data = wine, regtype = "ll")
bw_wine
##
## Regression Data (27 observations, 4 variable(s)):
##
## Age WinterRain AGST HarvestRain
## Bandwidth(s): 3742889 612250027 0.8725243 106.5079
##
## Regression Type: Local-Linear
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: Price ~ Age + WinterRain + AGST + HarvestRain
## Bandwidth Type: Fixed
## Objective Function Value: 0.0873955 (achieved on multistart 4)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
# Regression
fit_wine <- np::npreg(bw_wine)
summary(fit_wine)
##
## Regression Data: 27 training points, in 4 variable(s)
## Age WinterRain AGST HarvestRain
## Bandwidth(s): 3742889 612250027 0.8725243 106.5079
##
## Kernel Regression Estimator: Local-Linear
## Bandwidth Type: Fixed
## Residual standard error: 0.2079691
## R-squared: 0.8889649
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
# Plot the "marginal effects of each predictor" on the response
plot(fit_wine)
# These marginal effects are the p profiles of the estimated regression surface
# \hat{m}(x_1, ..., x_p) that are obtained by fixing the predictors to each of
# their median values. For example, the profile for Age is the curve
# \hat{m}(x, median_WinterRain, median_AGST, median_HarvestRain). The medians
# are:
apply(wine, 2, median)
## Year Price WinterRain AGST HarvestRain Age FrancePop
## 1967.0000 6.9845 600.0000 16.4167 123.0000 16.0000 50650.4060
# Therefore, conditionally on the median values of the predictors:
# - Age is positively related to Price (almost linearly)
# - WinterRain is positively related to Price (with a subtle nonlinearity)
# - AGST is positively related to Price, but now we see what it looks like a
# quadratic pattern
# - HarvestRain is negatively related to Price (almost linearly)
The many options for the plot
method for np::npreg
are available at ?np::npplot
. We illustrate as follows some of them with a view to enhancing the interpretability of the marginal effects.
# The argument "xq" controls the conditioning quantile of the predictors, by
# default the median (xq = 0.5). But xq can be a vector of p quantiles, for
# example (0.25, 0.5, 0.25, 0.75) for (Age, WinterRain, AGST, HarvestRain)
plot(fit_wine, xq = c(0.25, 0.5, 0.25, 0.75))
# With "plot.behavior = data" the plot() function returns a list with the data
# for performing the plots
res <- plot(fit_wine, xq = 0.5, plot.behavior = "data")
str(res, 1)
## List of 4
## $ r1:List of 35
## ..- attr(*, "class")= chr "npregression"
## $ r2:List of 35
## ..- attr(*, "class")= chr "npregression"
## $ r3:List of 35
## ..- attr(*, "class")= chr "npregression"
## $ r4:List of 35
## ..- attr(*, "class")= chr "npregression"
# Plot the marginal effect of AGST ($r3) alone
head(res$r3$eval) # All the predictors are constant (medians, except age)
## V1 V2 V3 V4
## 1 16 600 14.98330 123
## 2 16 600 15.03772 123
## 3 16 600 15.09214 123
## 4 16 600 15.14657 123
## 5 16 600 15.20099 123
## 6 16 600 15.25541 123
plot(res$r3$eval$V3, res$r3$mean, type = "l", xlab = "AGST",
ylab = "Marginal effect")
# Plot the marginal effects of AGST for varying quantiles in the rest of
# predictors (all with the same quantile)
tau <- seq(0.1, 0.9, by = 0.1)
res <- plot(fit_wine, xq = tau[1], plot.behavior = "data")
col <- viridis::viridis(length(tau))
plot(res$r3$eval$V3, res$r3$mean, type = "l", xlab = "AGST",
ylab = "Marginal effect", col = col[1], ylim = c(6, 9),
main = "Marginal effects of AGST for varying quantiles in the predictors")
for (i in 2:length(tau)) {
res <- plot(fit_wine, xq = tau[i], plot.behavior = "data")
lines(res$r3$eval$V3, res$r3$mean, col = col[i])
}
legend("topleft", legend = latex2exp::TeX(paste0("$\\tau =", tau, "$")),
col = col, lwd = 2)
# These quantiles are
apply(wine[c("Price", "WinterRain", "HarvestRain")], 2, quantile, prob = tau)
## Price WinterRain HarvestRain
## 10% 6.25594 419.2 65.2
## 20% 6.37630 508.8 86.2
## 30% 6.61312 567.8 94.6
## 40% 6.80224 576.2 114.4
## 50% 6.98450 600.0 123.0
## 60% 7.19044 609.2 156.8
## 70% 7.32512 691.4 173.6
## 80% 7.56964 716.4 187.0
## 90% 7.86230 785.4 255.0
The summary
of np::npreg
returns an \(R^2\). This statistic is defined as
\[\begin{align*} R^2:=\frac{(\sum_{i=1}^n(Y_i-\bar{Y})(\hat{Y}_i-\bar{Y}))^2}{(\sum_{i=1}^n(Y_i-\bar{Y})^2)(\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2)} \end{align*}\]
and is neither the squared correlation coefficient between \(Y_1,\ldots,Y_n\) and \(\hat{Y}_1,\ldots,\hat{Y}_n\,\)163 nor “the percentage of variance explained” by the model – this interpretation makes sense within the linear model context only. It is however a quantity in \([0,1]\) that attains \(R^2=1\) when the fit is perfect (zero variability about the curve), so it can give an idea of how explicative the estimated regression function is.
5.1.2 Kernel regression with mixed data
Non-continuous predictors can also be taken into account in nonparametric regression. The key to do so is an adequate definition of a suitable kernel function for any random variable \(X\), not just continuous. Therefore, we need to find
a positive function that is a pdf164 on the support of \(X\) and that allows to assign more weight to observations of the random variable that are close to a given point.
If such kernel is adequately defined, then it can be readily employed in the weights of the weighted local means that form \(\hat{m}(\cdot;q,\mathbf{h})\).
We analyze next the two main possibilities for non-continuous variables.
Categorical or unordered discrete variables. For example,
iris$species
andAuto$origin
are categorical variables in which ordering does not make any sense. Categorical variables are specified in base R byfactor
. Due to the lack of ordering, the basic mathematical operation behind a kernel, a distance computation165, is senseless. This motivates the Aitchison and Aitken (1976) kernel.Assume that the categorical random variable \(X_d\) has \(u_d\) different levels. These levels can be represented as \(U_d:=\{0,1,\ldots,u_d-1\}\). For \(x_d,X_d\in U_d\), the Aitchison and Aitken (1976) unordered discrete kernel is
\[\begin{align*} l_u(x_d,X_d;\lambda):=\begin{cases} 1-\lambda,&\text{ if }x_d=X_d,\\ \frac{\lambda}{c_d-1},&\text{ if }x_d\neq X_d, \end{cases} \end{align*}\]
where \(\lambda\in[0,(u_d-1)/u_d]\) is the bandwidth. Observe that this kernel is constant if \(x_d\neq X_d\): since the levels of the variable are unordered, there is no sense of proximity between them.
Ordinal or ordered discrete variables. For example,
wine$Year
andAuto$cylinders
are discrete variables with clear orders, but they are not continuous. These variables are specified byordered
(an orderedfactor
in base R). Despite the existence of an ordering, the possible distances between the observations of these variables are discrete.Assume that the ordered discrete random variable \(X_d\) takes value on a set \(O_d\). For \(x_d,X_d\in O_d\), a possible (Li and Racine 2007) ordered discrete kernel is166
\[\begin{align*} l_o(x_d,X_d;\eta):=\eta^{|x_d-X_d|}, \end{align*}\]
where \(\eta\in[0,1]\) is the bandwidth.
Exercise 5.3 Show that, for any \(X_d\in U_d\) and any \(\lambda\in[0,(u_d-1)/u_d]\), the kernel \(x\mapsto l_u(x,X_d;\lambda)\) “integrates” one over \(U_d\).
Once we have defined the suitably kernels for ordered and unordered discrete variables, we can define the Nadaraya–Watson for mixed multivariate data. Assume that, among the \(p=p_c+p_u+p_o\) predictors, the first \(p_c\) are continuous, the next \(p_u\) are discrete unordered (or categorical), and the last \(p_o\) are discrete ordered (or ordinal).
\[\begin{align} \hat{m}(\mathbf{x};0,(\mathbf{h}_c,\boldsymbol{\lambda}_u,\boldsymbol{\eta}_o)):=\sum_{i=1}^nW^0_{i}(\mathbf{x})Y_i,\tag{5.7} \end{align}\]
where \(\mathbf{h}_c=(h_1,\ldots,h_{p_c})'\), \(\boldsymbol{\lambda}_u=(\lambda_1,\ldots,\lambda_{p_u})'\), \(\boldsymbol{\eta}_o=(\eta_1,\ldots,\eta_{p_o})'\), and
\[\begin{align} W^0_{i}(\mathbf{x})=&\,\frac{L_\Pi(\mathbf{x}-\mathbf{X}_i)}{\sum_{i=1}^nL_\Pi(\mathbf{x}-\mathbf{X}_i)},\tag{5.8}\\ L_\Pi(\mathbf{x}-\mathbf{X}_i):=&\,\prod_{j=1}^{p_c}K_{h_j}(x_j-X_{ij})\prod_{k=1}^{p_u}l_u(x_k,X_{ik};\lambda_j)\prod_{\ell=1}^{p_o}l_o(x_\ell,X_{i\ell};\eta_j).\nonumber \end{align}\]
The adaptation of the local linear estimator is conceptually similar, but more cumbersome since, for example, linear approximations based on an unordered predictor do not make any sense. Therefore, it is skipped in these notes.
The np
package employs a variation of the previous kernels and implements the local constant and linear estimators for mixed multivariate data.
# Bandwidth by CV for local linear estimator
# Recall that Species is a factor!
bw_iris <- np::npregbw(formula = Petal.Length ~ Sepal.Width + Species,
data = iris)
bw_iris
##
## Regression Data (150 observations, 2 variable(s)):
##
## Sepal.Width Species
## Bandwidth(s): 0.1900717 1.093558e-08
##
## Regression Type: Local-Constant
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: Petal.Length ~ Sepal.Width + Species
## Bandwidth Type: Fixed
## Objective Function Value: 0.1564197 (achieved on multistart 2)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
# Product kernel with 2 bandwidths
# Regression
fit_iris <- np::npreg(bw_iris)
summary(fit_iris)
##
## Regression Data: 150 training points, in 2 variable(s)
## Sepal.Width Species
## Bandwidth(s): 0.1900717 1.093558e-08
##
## Kernel Regression Estimator: Local-Constant
## Bandwidth Type: Fixed
## Residual standard error: 0.3715505
## R-squared: 0.9554132
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 1
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
# Plot marginal effects (for quantile 0.5) of each predictor on the response
par(mfrow = c(1, 2))
plot(fit_iris, plot.par.mfrow = FALSE)
# Options for the plot method for np::npreg available at ?np::npplot
# Plot marginal effects (for quantile 0.9) of each predictor on the response
par(mfrow = c(1, 2))
plot(fit_iris, xq = 0.9, plot.par.mfrow = FALSE)
The following is an example from ?np::npreg
: the modeling of the GDP growth of a country from economic indicators. The predictors contain a mix of unordered, ordered, and continuous variables.
# Load data
data(oecdpanel, package = "np")
# Bandwidth by CV for local constant -- use only two starts to reduce the
# computation time
bw_OECD <- np::npregbw(formula = growth ~ factor(oecd) + ordered(year) +
initgdp + popgro + inv + humancap, data = oecdpanel,
regtype = "lc", nmulti = 2)
bw_OECD
##
## Regression Data (616 observations, 6 variable(s)):
##
## factor(oecd) ordered(year) initgdp popgro inv humancap
## Bandwidth(s): 0.01206855 0.8794929 0.3329306 0.05972217 0.159463 0.9408607
##
## Regression Type: Local-Constant
## Bandwidth Selection Method: Least Squares Cross-Validation
## Formula: growth ~ factor(oecd) + ordered(year) + initgdp + popgro + inv +
## humancap
## Bandwidth Type: Fixed
## Objective Function Value: 0.0006358702 (achieved on multistart 2)
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
##
## Ordered Categorical Kernel Type: Li and Racine
## No. Ordered Categorical Explanatory Vars.: 1
# Regression
fit_OECD <- np::npreg(bw_OECD)
summary(fit_OECD)
##
## Regression Data: 616 training points, in 6 variable(s)
## factor(oecd) ordered(year) initgdp popgro inv humancap
## Bandwidth(s): 0.01206855 0.8794929 0.3329306 0.05972217 0.159463 0.9408607
##
## Kernel Regression Estimator: Local-Constant
## Bandwidth Type: Fixed
## Residual standard error: 0.01736965
## R-squared: 0.7143626
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
##
## Unordered Categorical Kernel Type: Aitchison and Aitken
## No. Unordered Categorical Explanatory Vars.: 1
##
## Ordered Categorical Kernel Type: Li and Racine
## No. Ordered Categorical Explanatory Vars.: 1
# Plot marginal effects of each predictor on the response
par(mfrow = c(2, 3))
plot(fit_OECD, plot.par.mfrow = FALSE)
Exercise 5.4 Obtain the local constant estimator of the regression of mpg
on cylinders
(ordered discrete), horsepower
, weight
, and origin
in the data(Auto, package = "ISLR")
dataset. Summarize the model and plot the marginal effects of each predictor on the response (for the quantile \(0.5\)).
References
Aitchison, J., and C. G. G. Aitken. 1976. “Multivariate Binary Discrimination by the Kernel Method.” Biometrika 63 (3): 413–20. https://doi.org/10.1093/biomet/63.3.413.
Li, Q., and J. S. Racine. 2007. Nonparametric Econometrics. Princeton: Princeton University Press. https://press.princeton.edu/books/hardcover/9780691121611/nonparametric-econometrics.
Here \(q\) denotes the order of the polynomial fit, since \(p\) stands for the number of predictors.↩︎
In particular, the consideration of Taylor expansions of \(m:\mathbb{R}^p\longrightarrow\mathbb{R}\) of more than two orders that involve the consideration of the vector of partial derivatives \(\mathrm{D}^{\otimes s}m(\mathbf{x})\) formed by \(\frac{\partial^s m(\mathbf{x})}{\partial x_1^{s_1}\cdots \partial x_p^{s_1}}\), where \(s=s_1+\cdots+s_p\); see Section 3.1.↩︎
The fact that the kde for \((\mathbf{X},Y)\) has a product kernel between the variables \(\mathbf{X}\) and \(Y\) is not fundamental for constructing the Nadaraya–Watson estimator, but it simplifies the computation of the integral of the denominator of (5.2). The fact that the bandwidth matrices for estimating the \(\mathbf{X}\)-part of \(f\) and the marginal \(f_\mathbf{X}\) are equal, is fundamental to ensure an estimator for \(m\) that is a weighted average (with weights adding one) of the responses \(Y_1,\ldots,Y_n\).↩︎
Which increases at the order \(O(p^2)\).↩︎
Observe that \((X_i-x)^j\) are now replaced with \((X_{ij}-x_j)\), since we only consider a linear fit and now the predictors have \(p\) components.↩︎
Observe that, in an abuse of notation, we denote both the random vector \((X_1,\ldots,X_p)\) and the design matrix by \(\mathbf{X}\). However, the specific meaning of \(\mathbf{X}\) should be clear from the context.↩︎
Recall that now the entries of \(\hat{\boldsymbol{\beta}}_\mathbf{h}\) are estimating \(\boldsymbol{\beta}=\begin{pmatrix}m(\mathbf{x})\\\mathrm{D} m(\mathbf{x})\end{pmatrix}\). So, \(\hat{\boldsymbol{\beta}}_\mathbf{h}\) gives also an estimate of the gradient \(\mathrm{D} m\) evaluated at \(\mathbf{x}\).↩︎
Because it is not guaranteed that \(\bar{Y}=\bar{\hat{Y}}\), what it is required to turn \(R^2\) into that squared correlation coefficient.↩︎
Although in practice it is not required for the kernel to integrate one for performing regression, as the constants of the kernels cancel out in the quotients in \(\hat{m}(\cdot;q,\mathbf{h})\).↩︎
Recall \(K_h(x-X_i)\).↩︎
Notice that this kernel does not integrate one! To normalize it we should take the specific support of \(X_d\) into account, which, e.g., could be \(\{0, 1, 2, 3\}\) or \(\mathbb{N}\). This is slightly cumbersome, so we avoid it because there is no difference between normalized and unnormalized kernels for regression: the kernel normalizing constants cancel out in the computation of the weights (5.8).↩︎