3.3 Asymptotic properties

The asymptotic results for the multivariate kde are very similar to those of the univariate kde, but with an increasing notational complexity. Hopefully, the \(\mathrm{vec}\) operator, (3.5), and Theorem 3.1 allow for expression simplification and yield a clear connection with, for example, the expressions for the asymptotic bias and variance obtained in Theorem 2.1. As before, the insights obtained from these expressions will be highly valuable to select \(\mathbf{H}\) optimally in practice by means of the derivation of the MISE and AMISE errors.

We need to make the following assumptions:

  • A1.60 The density \(f\) is square integrable, twice continuously differentiable, and all the second order partial derivatives are square integrable.
  • A2.61 The kernel \(K\) is a spherically symmetric62 and bounded pdf with finite second moment and square integrable.
  • A3.63 \(\mathbf{H}=\mathbf{H}_n\) is a deterministic sequence of positive definite symmetric matrices such that, when \(n\to\infty,\) \(\mathrm{vec}\,\mathbf{H}\to\mathbf{0}\) and \(n|\mathbf{H}|^{1/2}\to\infty.\)

The convolution between two functions \(f,g:\mathbb{R}^p\longrightarrow\mathbb{R}\) is defined analogously as in the univariate case as \((f*g)(\mathbf{x}):=\int f(\mathbf{x}-\mathbf{y})g(\mathbf{y})\,\mathrm{d}\mathbf{y}.\) Thus, we readily obtain that

\[\begin{align} \mathbb{E}[\hat{f}(\mathbf{x};\mathbf{H})]&=\int K_\mathbf{H}(\mathbf{x}-\mathbf{y})f(\mathbf{y})\,\mathrm{d}\mathbf{y}\nonumber\\ &=(K_\mathbf{H} * f)(\mathbf{x}),\tag{3.11}\\ \mathbb{V}\mathrm{ar}[\hat{f}(\mathbf{x};\mathbf{H})]&=\frac{1}{n}((K_\mathbf{H}^2*f)(\mathbf{x})-(K_\mathbf{H}*f)^2(\mathbf{x})).\nonumber \end{align}\]

Again, although these two expressions are exact, they are hard to interpret. The only immediate insight that we are able to get is that, by equation (3.11), the kde is biased. But neither expression differentiates the effects of kernel, bandwidth, and density, the reason why asymptotic expressions are preferred. In what follows, we denote \(R(g)=\int g(\mathbf{z})^2\,\mathrm{d}\mathbf{z}\) for any function \(g:\mathbb{R}^p\longrightarrow\mathbb{R}.\)

Theorem 3.2 Under A1A3, the bias and variance of the kde at \(\mathbf{x}\) are

\[\begin{align} \mathrm{Bias}[\hat{f}(\mathbf{x};\mathbf{H})]&=\frac{1}{2}\mu_2(K)(\mathrm{D}^{\otimes2}f(\mathbf{x}))'\mathrm{vec}\,\mathbf{H}+o(\|\mathrm{vec}\,\mathbf{H}\|),\tag{3.12}\\ \mathbb{V}\mathrm{ar}[\hat{f}(\mathbf{x};\mathbf{H})]&=\frac{R(K)}{n|\mathbf{H}|^{1/2}}f(\mathbf{x})+o((n|\mathbf{H}|^{1/2})^{-1}).\tag{3.13} \end{align}\]

Proof. The proof follows the lines of the proof of Theorem 2.1 and we only provide a sketch. First, consider the change of variables \(\mathbf{z}=\mathbf{H}^{-1/2}(\mathbf{x}-\mathbf{y}),\) \(\mathbf{y}=\mathbf{x}-\mathbf{H}^{1/2}\mathbf{z},\) \(\mathrm{d}\mathbf{y}=-|\mathbf{H}|^{1/2}\,\mathrm{d}\mathbf{z}.\) The integral limits flip and we have

\[\begin{align*} \mathbb{E}[\hat{f}(\mathbf{x};\mathbf{H})]&=\int K_\mathbf{H}(\mathbf{x}-\mathbf{y})f(\mathbf{y})\,\mathrm{d}\mathbf{y}\\ &=\int K(\mathbf{z})f(\mathbf{x}-\mathbf{H}^{1/2}\mathbf{z})\,\mathrm{d}\mathbf{z}. \end{align*}\]

Since \(\mathbf{H}\to\mathbf{0},\) we can apply (3.6) to \(f(\mathbf{x}-\mathbf{H}^{1/2}\mathbf{z})\) and then use the properties of the kernel to arrive to (3.12). Therefore, (3.13) is obtained by adapting the steps of the bias and replicating the arguments in the proof of Theorem 2.1.

Exercise 3.7 Detail, elaborate, and conclude the proof above.

The bias and variance expressions (3.12) and (3.13) give important insights:

  • The bias decreases with \(\mathbf{H}.\)64 By observing that \((\mathrm{D}^{\otimes2}f(\mathbf{x}))'\mathrm{vec}\,\mathbf{H}=\mathrm{tr}((\mathrm{H}f(\mathbf{x}))'\mathbf{H})\,\)65 we have interesting interpretations:

    • The bias is negative whenever \(\mathrm{H}f(\mathbf{x})\) is negative definite.66 These regions correspond to the modes (or local maxima) of \(f,\) where the kde underestimates \(f\) (it tends to be below \(f\)).
    • Conversely, the bias is positive whenever \(\mathrm{H}f(\mathbf{x})\) is positive definite, which happens in the antimodes (or local minima) of \(f,\) where the kde overestimates \(f\) (it tends to be above \(f\)).
    • The wilder the curvature \(\mathrm{D}^{\otimes 2}f,\) the harder to estimate \(f.\) Flat density regions are easier to estimate than wiggling regions with high curvature (e.g., with several modes).
  • The variance depends directly on \(f(\mathbf{x}).\) The higher the density, the more variable the kde is. The variance decreases as a factor of \((n|\mathbf{H}|^{1/2})^{-1},\) a consequence of \(n|\mathbf{H}|^{1/2}\)’s playing the role of the effective sample size for estimating \(f(\mathbf{x}).\)


  1. This assumption requires certain smoothness of \(f,\) allowing thus for Theorem 3.1 to be applied.↩︎

  2. Mild assumption that makes the first term of the Taylor expansion of \(f\) negligible and the second bounded.↩︎

  3. This is the extension of the symmetry requirement for a univariate kernel to \(\mathbb{R}^p.\) The spherical symmetry of \(K\) implies that \(\int\mathbf{z}K(\mathbf{z})\,\mathrm{d}\mathbf{z}=\mathbf{0}\) and that \(\int\mathbf{z}\mathbf{z}'K(\mathbf{z})\,\mathrm{d}\mathbf{z}=\mu_2(K)\mathbf{I}_p\) (the covariances are zero), where \(\mu_2(K):=\int z_j^2K(\mathbf{z})\,\mathrm{d}\mathbf{z}=\int z_k^2K(\mathbf{z})\,\mathrm{d}\mathbf{z}\) for all \(j,k=1,\ldots,p.\) Equivalently, \(\int\mathbf{z}^{\otimes2}K(\mathbf{z})\,\mathrm{d}\mathbf{z}=\mu_2(K)\mathrm{vec}\,\mathbf{I}_p.\)↩︎

  4. The key assumption for reducing the bias and variance of \(\hat{f}(\cdot;\mathbf{H})\) simultaneously.↩︎

  5. If \(\mathbf{H}=\mathrm{diag}(h_1^2,\ldots,h_p^2),\) the reduction is clearly seen to be quadratic on the marginal bandwidths.↩︎

  6. By (3.4), \(\mathrm{D}^{\otimes2}f(\mathbf{x})=\mathrm{vec}(\mathrm{H}f(\mathbf{x})).\) On the other hand, \((\mathrm{vec}\,\mathbf{A})'\mathrm{vec}\,\mathbf{B}=\mathrm{tr}(\mathbf{A}'\mathbf{B})\) for any matrices \(\mathbf{A}\) and \(\mathbf{B}.\)↩︎

  7. For any two symmetric matrices \(\mathbf{A}\) and \(\mathbf{B}\) of size \(p\times p\) having as sorted eigenvalues \(\alpha_1,\ldots,\alpha_p\) and \(\beta_1,\ldots,\beta_p,\) respectively, it is satisfied that \(\sum_{i=1}^p\alpha_i\beta_{p-i}\leq\mathrm{tr}(\mathbf{A}\mathbf{B})\leq\sum_{i=1}^p\alpha_i\beta_i.\) Taking \(\mathbf{B}=\mathbf{H},\) we know that its eigenvalues are positive because of the positive definiteness of \(\mathbf{H}.\) If \(\mathbf{A}=(\mathrm{H}f(\mathbf{x}))'\) is negative definite, then all its eigenvalues are negative and, as a consequence, \(\mathrm{tr}(\mathbf{A}\mathbf{B})<0.\)↩︎