6.1 Inference
Since Yi=f(xi,θ)+ϵi, where ϵi∼iid(0,σ2), we can estimate parameters (ˆθ) by minimizing the sum of squared errors:
n∑i=1(Yi−f(xi,θ))2
Let ˆθ be the minimizer, the variance of residuals is estimated as:
s2=ˆσ2ϵ=∑ni=1(Yi−f(xi,ˆθ))2n−p
where p is the number of parameters in θ, and n is the number of observations.
Asymptotic Distribution of ˆθ
Under regularity conditions—most notably that ϵi∼N(0,σ2) or that n is sufficiently large for a central-limit-type argument—the parameter estimates ˆθ have the following asymptotic normal distribution:
ˆθ∼AN(θ,σ2[F(θ)′F(θ)]−1)
where
- AN stands for “asymptotic normality.”
- F(θ) is the n×p Jacobian matrix of partial derivatives of f(xi,θ) with respect to θ, evaluated at ˆθ. Specifically,
\mathbf{F}(\theta) = \begin{pmatrix} \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_1, \boldsymbol{\theta})}{\partial \theta_p} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_1} & \cdots & \frac{\partial f(\mathbf{x}_n, \boldsymbol{\theta})}{\partial \theta_p} \end{pmatrix}
Asymptotic normality means that as the sample size n becomes large, the sampling distribution of \hat{\theta} approaches a normal distribution, which enables inference on the parameters.
6.1.1 Linear Functions of the Parameters
A “linear function of the parameters” refers to a quantity that can be written as \mathbf{a}'\boldsymbol{\theta}, where \mathbf{a} is some (constant) contrast vector. Common examples include:
A single parameter \theta_j (using a vector \mathbf{a} with 1 in the j-th position and 0 elsewhere).
Differences, sums, or other contrasts, e.g. \theta_1 - \theta_2.
Suppose we are interested in a linear combination of the parameters, such as \theta_1 - \theta_2. Define the contrast vector \mathbf{a} as:
\mathbf{a} = (0, 1, -1)'
We then consider inference for \mathbf{a'\theta} (\mathbf{a} can be p-dimensional vector). Using rules for the expectation and variance of a linear combination of a random vector \mathbf{Z}:
\begin{aligned} E(\mathbf{a'Z}) &= \mathbf{a'}E(\mathbf{Z}) \\ \text{Var}(\mathbf{a'Z}) &= \mathbf{a'} \text{Var}(\mathbf{Z}) \mathbf{a} \end{aligned}
We have
\begin{aligned} E(\mathbf{a'\hat{\theta}}) &= \mathbf{a'}E(\hat{\theta}) \approx \mathbf{a}' \theta \\ \text{Var}(\mathbf{a'} \hat{\theta}) &= \mathbf{a'} \text{Var}(\hat{\theta}) \mathbf{a} \approx \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a} \end{aligned}
Hence,
\mathbf{a'\hat{\theta}} \sim AN\big(\mathbf{a'\theta}, \sigma^2 \mathbf{a'[\mathbf{F}(\theta)'\mathbf{F}(\theta)]^{-1}a}\big)
Confidence Intervals for Linear Contrasts
Since \mathbf{a'\hat{\theta}} is asymptotically independent of s^2 (up to order O1/n), a two-sided 100(1-\alpha)\% confidence interval for \mathbf{a'\theta} is given by:
\mathbf{a'\theta} \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{a'[\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}a}}
where
- t_{(1-\alpha/2, n-p)} is the critical value of the t-distribution with n - p degrees of freedom.
- s = \sqrt{\hat{\sigma^2}_\epsilon} is the estimated standard deviation of residuals.
Special Case: A Single Parameter \theta_j
If we focus on a single parameter \theta_j, let \mathbf{a'} = (0, \dots, 1, \dots, 0) (with 1 at the j-th position). Then, the confidence interval for \theta_j becomes:
\hat{\theta}_j \pm t_{(1-\alpha/2, n-p)} s \sqrt{\hat{c}^j}
where \hat{c}^j is the j-th diagonal element of [\mathbf{F}(\hat{\theta})'\mathbf{F}(\hat{\theta})]^{-1}.
6.1.2 Nonlinear Functions of Parameters
In many cases, we are interested in nonlinear functions of \boldsymbol{\theta}. Let h(\boldsymbol{\theta}) be such a function (e.g., a ratio of parameters, a difference of exponentials, etc.).
When h(\theta) is a nonlinear function of the parameters, we can use a Taylor series expansion about \theta to approximate h(\hat{\theta}):
h(\hat{\theta}) \approx h(\theta) + \mathbf{h}' [\hat{\theta} - \theta]
where
- \mathbf{h} = \left( \frac{\partial h}{\partial \theta_1}, \frac{\partial h}{\partial \theta_2}, \dots, \frac{\partial h}{\partial \theta_p} \right)' is the gradient vector of partial derivatives.
Key Approximations:
Expectation and Variance of \hat{\theta} (using the asymptotic normality of \hat{\theta}: \begin{aligned} E(\hat{\theta}) &\approx \theta, \\ \text{Var}(\hat{\theta}) &\approx \sigma^2 [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1}. \end{aligned}
Expectation and Variance of h(\hat{\theta}) (properties of expectation and variance of (approximately) linear transformations): \begin{aligned} E(h(\hat{\theta})) &\approx h(\theta), \\ \text{Var}(h(\hat{\theta})) &\approx \sigma^2 \mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}. \end{aligned}
Combining these results, we find:
h(\hat{\theta}) \sim AN(h(\theta), \sigma^2 \mathbf{h}' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}),
where AN represents asymptotic normality.
Confidence Interval for h(\theta):
An approximate 100(1-\alpha)\% confidence interval for h(\theta) is:
h(\hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{h}'[\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{h}},
where \mathbf{h} and \mathbf{F}(\theta) are evaluated at \hat{\theta}.
To compute a prediction interval for a new observation Y_0 at x = x_0:
Model Definition: Y_0 = f(x_0; \theta) + \epsilon_0, \quad \epsilon_0 \sim N(0, \sigma^2), with the predicted value: \hat{Y}_0 = f(x_0, \hat{\theta}).
Approximation for \hat{Y}_0: As n \to \infty, \hat{\theta} \to \theta, so we have: f(x_0, \hat{\theta}) \approx f(x_0, \theta) + \mathbf{f}_0(\theta)' [\hat{\theta} - \theta], where: \mathbf{f}_0(\theta) = \left( \frac{\partial f(x_0, \theta)}{\partial \theta_1}, \dots, \frac{\partial f(x_0, \theta)}{\partial \theta_p} \right)'.
Error Approximation: \begin{aligned}Y_0 - \hat{Y}_0 &\approx Y_0 - f(x_0,\theta) - f_0(\theta)'[\hat{\theta}-\theta] \\&= \epsilon_0 - f_0(\theta)'[\hat{\theta}-\theta]\end{aligned}
Variance of Y_0 - \hat{Y}_0: \begin{aligned} \text{Var}(Y_0 - \hat{Y}_0) &\approx \text{Var}(\epsilon_0 - \mathbf{f}_0(\theta)' [\hat{\theta} - \theta]) \\ &= \sigma^2 + \sigma^2 \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta) \\ &= \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big). \end{aligned}
Hence, the prediction error Y_0 - \hat{Y}_0 follows an asymptotic normal distribution:
Y_0 - \hat{Y}_0 \sim AN\big(0, \sigma^2 \big(1 + \mathbf{f}_0(\theta)' [\mathbf{F}(\theta)' \mathbf{F}(\theta)]^{-1} \mathbf{f}_0(\theta)\big)\big).
A 100(1-\alpha)\% prediction interval for Y_0 is:
\hat{Y}_0 \pm t_{(1-\alpha/2, n-p)} s \sqrt{1 + \mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}.
where we substitute \hat{\theta} into \mathbf{f}_0 and \mathbf{F}. Recall s is the estiamte of \sigma.
Sometimes we want a confidence interval for E(Y_i) (i.e., the mean response at x_0), rather than a prediction interval for an individual future observation. In that case, the variance term for the random error \epsilon_0 is not included. Hence, the formula is the same but without the “+1”:
E(Y_0) \approx f(x_0; \theta),
and the confidence interval is:
f(x_0, \hat{\theta}) \pm t_{(1-\alpha/2, n-p)} s \sqrt{\mathbf{f}_0(\hat{\theta})' [\mathbf{F}(\hat{\theta})' \mathbf{F}(\hat{\theta})]^{-1} \mathbf{f}_0(\hat{\theta})}.
Summary
Linear Functions of the Parameters: A function f(x, \theta) is linear in \theta if it can be written in the form f(x, \theta) = \theta_1 g_1(x) + \theta_2 g_2(x) + \dots + \theta_p g_p(x) where g_j(x) do not depend on \theta. In this case, the Jacobian \mathbf{F}(\theta) does not depend on \theta itself (only on x_i), and exact formulas often match what is familiar from linear regression.
Nonlinear Functions of Parameters: If f(x, \theta) depends on \theta in a nonlinear way (e.g., \theta_1 e^{\theta_2 x}, \beta_1/\beta_2 or more complicated expressions), \mathbf{F}(\theta) depends on \theta. Estimation generally requires iterative numerical methods (e.g., Gauss–Newton, Levenberg–Marquardt), and the asymptotic results rely on evaluating partial derivatives at \hat{\theta}. Nevertheless, the final inference formulas—confidence intervals, prediction intervals—have a similar form, thanks to the asymptotic normality arguments.