6.1 Inference

Since Yi=f(xi,θ)+ϵi, where ϵiiid(0,σ2), we can estimate parameters (ˆθ) by minimizing the sum of squared errors:

ni=1(Yif(xi,θ))2

Let ˆθ be the minimizer, the variance of residuals is estimated as:

s2=ˆσ2ϵ=ni=1(Yif(xi,ˆθ))2np

where p is the number of parameters in θ, and n is the number of observations.


Asymptotic Distribution of ˆθ

Under regularity conditions—most notably that ϵiN(0,σ2) or that n is sufficiently large for a central-limit-type argument—the parameter estimates ˆθ have the following asymptotic normal distribution:

ˆθAN(θ,σ2[F(θ)F(θ)]1)

where

  • AN stands for “asymptotic normality.”
  • F(θ) is the n×p Jacobian matrix of partial derivatives of f(xi,θ) with respect to θ, evaluated at ˆθ. Specifically,

\mathbf{F}(\theta) = \begin{pmatrix} \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_p} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_p} \end{pmatrix}

Asymptotic normality means that as the sample size n becomes large, the sampling distribution of \hat{\theta} approaches a normal distribution, which enables inference on the parameters.

6.1.1 Linear Functions of the Parameters

A “linear function of the parameters” refers to a quantity that can be written as \mathbf{a}'\boldsymbol{\theta}, where \mathbf{a} is some (constant) contrast vector. Common examples include:

  • A single parameter \theta_j (using a vector \mathbf{a} with 1 in the j-th position and 0 elsewhere).

  • Differences, sums, or other contrasts, e.g. \theta_1 - \theta_2.

Suppose we are interested in a linear combination of the parameters, such as \theta_1 - \theta_2. Define the contrast vector \mathbf{a} as:

\mathbf{a} = (0, 1, -1)'

We then consider inference for \mathbf{a'\theta} (\mathbf{a} can be p-dimensional vector). Using rules for the expectation and variance of a linear combination of a random vector \mathbf{Z}:

\begin{aligned} E(\mathbf{a'Z}) &= \mathbf{a'}E(\mathbf{Z}) \\ \text{Var}(\mathbf{a'Z}) &= \mathbf{a'} \text{Var}(\mathbf{Z}) \mathbf{a} \end{aligned}

We have

\begin{aligned} E(\mathbf{a'\hat{\theta}}) &= \mathbf{a'}E(\hat{\theta}) \approx \mathbf{a}' \theta \\ \text{Var}(\mathbf{a'} \hat{\theta}) &= \mathbf{a'} \text{Var}(\hat{\theta}) \mathbf{a} \approx \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a} \end{aligned}

Hence,

\mathbf{a'\hat{\theta}} \sim AN\big(\mathbf{a'\theta}, \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a}\big)


Confidence Intervals for Linear Contrasts

Since \mathbf{a'\hat{\theta}} is asymptotically independent of s^2 (up to order O1/n), a two-sided 100(1-\alpha)\% confidence interval for \mathbf{a'\theta} is given by:

\mathbf{a'\theta} \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{a'[\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}a}}

where

  • t_{(1-\alpha/2, n-p)} is the critical value of the t-distribution with n - p degrees of freedom.
  • s = \sqrt{\hat{\sigma^2}_\epsilon} is the estimated standard deviation of residuals.

Special Case: A Single Parameter \theta_j

If we focus on a single parameter \theta_j, let \mathbf{a'} = (0, \dots, 1, \dots, 0) (with 1 at the j-th position). Then, the confidence interval for \theta_j becomes:

\hat{\theta}_j \pm t_{(1-\alpha/2, n-p)} s \sqrt{\hat{c}^j}

where \hat{c}^j is the j-th diagonal element of [\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}.


6.1.2 Nonlinear Functions of Parameters

In many cases, we are interested in nonlinear functions of \boldsymbol{\theta}. Let h(\boldsymbol{\theta}) be such a function (e.g., a ratio of parameters, a difference of exponentials, etc.).

When h(\theta) is a nonlinear function of the parameters, we can use a Taylor series expansion about \theta to approximate h(\hat{\theta}):

h(\hat{\theta}) \approx h(\theta) + \mathbf{h}' [\hat{\theta} - \theta]

where

  • \mathbf{h} = \left( \frac{\partial h}{\partial \theta_1}, \frac{\partial h}{\partial \theta_2}, \dots, \frac{\partial h}{\partial \theta_p} \right)' is the gradient vector of partial derivatives.

Key Approximations:

  1. Expectation and Variance of \hat{\theta} (using the asymptotic normality of \hat{\theta}: \begin{aligned} E(\hat{\theta}) &\approx \theta, \\ \text{Var}(\hat{\theta}) &\approx \sigma^2 [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1}. \end{aligned}

  2. Expectation and Variance of h(\hat{\theta}) (properties of expectation and variance of (approximately) linear transformations): \begin{aligned} E(h(\hat{\theta})) &\approx h(\theta), \\ \text{Var}(h(\hat{\theta})) &\approx \sigma^2 \mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}. \end{aligned}

Combining these results, we find:

h(\hat{\theta}) \sim AN(h(\theta), \sigma^2 \mathbf{h}' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}),

where AN represents asymptotic normality.

Confidence Interval for h(\theta):

An approximate 100(1-\alpha)\% confidence interval for h(\theta) is:

h(\hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}},

where \mathbf{h} and \mathbf{F}(\theta) are evaluated at \hat{\theta}.


To compute a prediction interval for a new observation Y_0 at x = x_0:

  1. Model Definition: Y_0 = f(x_0; \theta) + \epsilon_0, \quad \epsilon_0 \sim N(0, \sigma^2), with the predicted value: \hat{Y}_0 = f(x_0, \hat{\theta}).

  2. Approximation for \hat{Y}_0: As n \to \infty, \hat{\theta} \to \theta, so we have: f(x_0, \hat{\theta}) \approx f(x_0, \theta) + \mathbf{f}_0(\theta)' [\hat{\theta} - \theta], where: \mathbf{f}_0(\theta) = \left( \frac{\partial f(x_0, \theta)}{\partial \theta_1}, \dots, \frac{\partial f(x_0, \theta)}{\partial \theta_p} \right)'.

  3. Error Approximation: \begin{aligned}Y_0 - \hat{Y}_0 &\approx Y_0 - f(x_0,\theta) - f_0(\theta)'[\hat{\theta}-\theta] \\&= \epsilon_0 - f_0(\theta)'[\hat{\theta}-\theta]\end{aligned}

  4. Variance of Y_0 - \hat{Y}_0: \begin{aligned} \text{Var}(Y_0 - \hat{Y}_0) &\approx \text{Var}(\epsilon_0 - \mathbf{f}_0(\theta)' [\hat{\theta} - \theta]) \\ &= \sigma^2 + \sigma^2 \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta) \\ &= \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big). \end{aligned}

Hence, the prediction error Y_0 - \hat{Y}_0 follows an asymptotic normal distribution:

Y_0 - \hat{Y}_0 \sim AN\big(0, \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big)\big).

A 100(1-\alpha)\% prediction interval for Y_0 is:

\hat{Y}_0 \pm t_{(1-\alpha/2, n-p)} s \sqrt{1 + \mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}.

where we substitute \hat{\theta} into \mathbf{f}_0 and \mathbf{F}. Recall s is the estiamte of \sigma.


Sometimes we want a confidence interval for E(Y_i) (i.e., the mean response at x_0), rather than a prediction interval for an individual future observation. In that case, the variance term for the random error \epsilon_0 is not included. Hence, the formula is the same but without the “+1”:

E(Y_0) \approx f(x_0; \theta),

and the confidence interval is:

f(x_0, \hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}.

Summary

  • Linear Functions of the Parameters: A function f(x, \theta) is linear in \theta if it can be written in the form f(x, \theta) = \theta_1 g_1(x) + \theta_2 g_2(x) + \dots + \theta_p g_p(x) where g_j(x) do not depend on \theta. In this case, the Jacobian \mathbf{F}(\theta) does not depend on \theta itself (only on x_i), and exact formulas often match what is familiar from linear regression.

  • Nonlinear Functions of Parameters: If f(x, \theta) depends on \theta in a nonlinear way (e.g., \theta_1 e^{\theta_2 x}, \beta_1/\beta_2 or more complicated expressions), \mathbf{F}(\theta) depends on \theta. Estimation generally requires iterative numerical methods (e.g., Gauss–Newton, Levenberg–Marquardt), and the asymptotic results rely on evaluating partial derivatives at \hat{\theta}. Nevertheless, the final inference formulas—confidence intervals, prediction intervals—have a similar form, thanks to the asymptotic normality arguments.