6.1 Inference

Since \(Y_i = f(\mathbf{x}_i, \theta) + \epsilon_i\), where \(\epsilon_i \sim \text{iid}(0, \sigma^2)\), we can estimate parameters (\(\hat{\theta}\)) by minimizing the sum of squared errors:

\[ \sum_{i=1}^{n} \big(Y_i - f(\mathbf{x}_i, \theta)\big)^2 \]

Let \(\hat{\theta}\) be the minimizer, the variance of residuals is estimated as:

\[ s^2 = \hat{\sigma}^2_{\epsilon} = \frac{\sum_{i=1}^{n} \big(Y_i - f(\mathbf{x}_i, \hat{\theta})\big)^2}{n - p} \]

where \(p\) is the number of parameters in \(\mathbf{\theta}\), and \(n\) is the number of observations.

Asymptotic Distribution of \(\hat{\theta}\)

Under regularity conditions (most notably that \(\epsilon_i \sim N(0, \sigma^2)\) or that \(n\) is sufficiently large for a central-limit-type argument), the parameter estimates \(\hat{\theta}\) have the following asymptotic normal distribution:

\[ \hat{\theta} \sim AN(\mathbf{\theta}, \sigma^2[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}) \]

where

\(AN\) stands for “asymptotic normality.”
\(\mathbf{F}(\theta)\) is the \(n \times p\) Jacobian matrix of partial derivatives of \(f(\mathbf{x}_i, \theta)\) with respect to \(\mathbf{\theta}\), evaluated at \(\hat{\theta}\). Specifically,

\[ \mathbf{F}(\theta) = \begin{pmatrix} \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_p} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_p} \end{pmatrix} \]

Asymptotic normality means that as the sample size \(n\) becomes large, the sampling distribution of \(\hat{\theta}\) approaches a normal distribution, which enables inference on the parameters.

6.1.1 Linear Functions of the Parameters

A “linear function of the parameters” refers to a quantity that can be written as \(\mathbf{a}'\boldsymbol{\theta}\), where \(\mathbf{a}\) is some (constant) contrast vector. Common examples include:

A single parameter \(\theta_j\) (using a vector \(\mathbf{a}\) with 1 in the \(j\)-th position and 0 elsewhere).
Differences, sums, or other contrasts, e.g. \(\theta_1 - \theta_2\).

Suppose we are interested in a linear combination of the parameters, such as \(\theta_1 - \theta_2\). Define the contrast vector \(\mathbf{a}\) as:

\[ \mathbf{a} = (0, 1, -1)' \]

We then consider inference for \(\mathbf{a'\theta}\) (\(\mathbf{a}\) can be \(p\)-dimensional vector). Using rules for the expectation and variance of a linear combination of a random vector \(\mathbf{Z}\):

\[ \begin{aligned} E(\mathbf{a'Z}) &= \mathbf{a'}E(\mathbf{Z}) \\ \text{Var}(\mathbf{a'Z}) &= \mathbf{a'} \text{Var}(\mathbf{Z}) \mathbf{a} \end{aligned} \]

We have

\[ \begin{aligned} E(\mathbf{a'\hat{\theta}}) &= \mathbf{a'}E(\hat{\theta}) \approx \mathbf{a}' \theta \\ \text{Var}(\mathbf{a'} \hat{\theta}) &= \mathbf{a'} \text{Var}(\hat{\theta}) \mathbf{a} \approx \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a} \end{aligned} \]

Hence,

\[ \mathbf{a'\hat{\theta}} \sim AN\big(\mathbf{a'\theta}, \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a}\big) \]

Confidence Intervals for Linear Contrasts

Since \(\mathbf{a'\hat{\theta}}\) is asymptotically independent of \(s^2\) (up to order \(O1/n\)), a two-sided \(100(1-\alpha)\%\) confidence interval for \(\mathbf{a'\theta}\) is given by:

\[ \mathbf{a'\theta} \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{a'[\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}a}} \]

where

\(t_{(1-\alpha/2, n-p)}\) is the critical value of the \(t\)-distribution with \(n - p\) degrees of freedom.
\(s = \sqrt{\hat{\sigma^2}_\epsilon}\) is the estimated standard deviation of residuals.

Special Case: A Single Parameter \(\theta_j\)

If we focus on a single parameter \(\theta_j\), let \(\mathbf{a'} = (0, \dots, 1, \dots, 0)\) (with 1 at the \(j\)-th position). Then, the confidence interval for \(\theta_j\) becomes:

\[ \hat{\theta}_j \pm t_{(1-\alpha/2, n-p)} s \sqrt{\hat{c}^j} \]

where \(\hat{c}^j\) is the \(j\)-th diagonal element of \([\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}\).

6.1.2 Nonlinear Functions of Parameters

In many cases, we are interested in nonlinear functions of \(\boldsymbol{\theta}\). Let \(h(\boldsymbol{\theta})\) be such a function (e.g., a ratio of parameters, a difference of exponentials, etc.).

When \(h(\theta)\) is a nonlinear function of the parameters, we can use a Taylor series expansion about \(\theta\) to approximate \(h(\hat{\theta})\):

\[ h(\hat{\theta}) \approx h(\theta) + \mathbf{h}' [\hat{\theta} - \theta] \]

where

\(\mathbf{h} = \left( \frac{\partial h}{\partial \theta_1}, \frac{\partial h}{\partial \theta_2}, \dots, \frac{\partial h}{\partial \theta_p} \right)'\) is the gradient vector of partial derivatives.

Key Approximations:

Expectation and Variance of \(\hat{\theta}\) (using the asymptotic normality of \(\hat{\theta}\): \[ \begin{aligned} E(\hat{\theta}) &\approx \theta, \\ \text{Var}(\hat{\theta}) &\approx \sigma^2 [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1}. \end{aligned} \]
Expectation and Variance of \(h(\hat{\theta})\) (properties of expectation and variance of (approximately) linear transformations): \[ \begin{aligned} E(h(\hat{\theta})) &\approx h(\theta), \\ \text{Var}(h(\hat{\theta})) &\approx \sigma^2 \mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}. \end{aligned}\]

Combining these results, we find:

\[ h(\hat{\theta}) \sim AN(h(\theta), \sigma^2 \mathbf{h}' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}), \]

where \(AN\) represents asymptotic normality.

Confidence Interval for \(h(\theta)\):

An approximate \(100(1-\alpha)\%\) confidence interval for \(h(\theta)\) is:

\[ h(\hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}}, \]

where \(\mathbf{h}\) and \(\mathbf{F}(\theta)\) are evaluated at \(\hat{\theta}\).

To compute a prediction interval for a new observation \(Y_0\) at \(x = x_0\):

Model Definition: \[ Y_0 = f(x_0; \theta) + \epsilon_0, \quad \epsilon_0 \sim N(0, \sigma^2), \] with the predicted value: \[ \hat{Y}_0 = f(x_0, \hat{\theta}). \]
Approximation for \(\hat{Y}_0\): As \(n \to \infty\), \(\hat{\theta} \to \theta\), so we have: \[ f(x_0, \hat{\theta}) \approx f(x_0, \theta) + \mathbf{f}_0(\theta)' [\hat{\theta} - \theta], \] where: \[ \mathbf{f}_0(\theta) = \left( \frac{\partial f(x_0, \theta)}{\partial \theta_1}, \dots, \frac{\partial f(x_0, \theta)}{\partial \theta_p} \right)'. \]
Error Approximation: \[ \begin{aligned}Y_0 - \hat{Y}_0 &\approx Y_0 - f(x_0,\theta) - f_0(\theta)'[\hat{\theta}-\theta] \\&= \epsilon_0 - f_0(\theta)'[\hat{\theta}-\theta]\end{aligned} \]
Variance of \(Y_0 - \hat{Y}_0\): \[ \begin{aligned} \text{Var}(Y_0 - \hat{Y}_0) &\approx \text{Var}(\epsilon_0 - \mathbf{f}_0(\theta)' [\hat{\theta} - \theta]) \\ &= \sigma^2 + \sigma^2 \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta) \\ &= \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big). \end{aligned} \]

Hence, the prediction error \(Y_0 - \hat{Y}_0\) follows an asymptotic normal distribution:

\[ Y_0 - \hat{Y}_0 \sim AN\big(0, \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big)\big). \]

A \(100(1-\alpha)\%\) prediction interval for \(Y_0\) is:

\[ \hat{Y}_0 \pm t_{(1-\alpha/2, n-p)} s \sqrt{1 + \mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}. \]

where we substitute \(\hat{\theta}\) into \(\mathbf{f}_0\) and \(\mathbf{F}\). Recall \(s\) is the estiamte of \(\sigma\).

Sometimes we want a confidence interval for \(E(Y_i)\) (i.e., the mean response at \(x_0\)), rather than a prediction interval for an individual future observation. In that case, the variance term for the random error \(\epsilon_0\) is not included. Hence, the formula is the same but without the “+1”:

\[ E(Y_0) \approx f(x_0; \theta), \]

and the confidence interval is:

\[ f(x_0, \hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}. \]

Summary

Linear Functions of the Parameters: A function \(f(x, \theta)\) is linear in \(\theta\) if it can be written in the form \[f(x, \theta) = \theta_1 g_1(x) + \theta_2 g_2(x) + \dots + \theta_p g_p(x)\] where \(g_j(x)\) do not depend on \(\theta\). In this case, the Jacobian \(\mathbf{F}(\theta)\) does not depend on \(\theta\) itself (only on \(x_i\)), and exact formulas often match what is familiar from linear regression.
Nonlinear Functions of Parameters: If \(f(x, \theta)\) depends on \(\theta\) in a nonlinear way (e.g., \(\theta_1 e^{\theta_2 x}, \beta_1/\beta_2\) or more complicated expressions), \(\mathbf{F}(\theta)\) depends on \(\theta\). Estimation generally requires iterative numerical methods (e.g., Gauss–Newton, Levenberg–Marquardt), and the asymptotic results rely on evaluating partial derivatives at \(\hat{\theta}\). Nevertheless, the final inference formulas (e.g., confidence intervals, prediction intervals) have a similar form, thanks to the asymptotic normality arguments.