Delta Method, Sufficiency principle (Lecture on 01/14/2020)
Starting from limiting distribution of standardized random variable, we now consider distribution of some function of random variable. Delta method is such a method of approximating the mean and variance of a random variable, based on using a Taylor series approximation.
Definition 3.1 (Taylor Polynomial) If a function \(g(x)\) has derivatives of order r, that is \(g^{(r)}(x)=\frac{d^r}{dx^r}g(x)\) exists, then for any constant a, the Taylor polynomial of order r about a is
\[\begin{equation}
T_r(x)=\sum_{i=0}^{r}g^{(r)}(a)\frac{(x-a)^i}{i!}
\tag{3.1}
\end{equation}\]
Taylor’s major theorem states that the remainder from the approximation \(g(x)-T_r(x)\) always tends to 0 faster than the highest-order explicit term. That is
\[\begin{equation}
\lim_{x\to a}\frac{g(x)-T_r(x)}{(x-a)^r}=0
\tag{3.2}
\end{equation}\]
For multivariate case, let \(T_1,\cdots,T_k\) be random variables with means \(\theta_1,\cdots,\theta_k\) and define \(\mathbf{T}=(T_1,\cdots,T_k)\) and \(\mathbf{\theta}=(\theta_1,\cdots,\theta_k)\). Suppose there is a differentiable function \(g(\mathbf{T})\) for which we want an approximate estimate of variance. Define
\[\begin{equation}
g'_i(\mathbf{\theta})=\frac{\partial}{\partial t_i}g(\mathbf{t})|_{t_1=\theta_1,\cdots,t_k=\theta_k}
\tag{3.3}
\end{equation}\]
The first-order Taylor series expansion of g about \(\mathbf{\theta}\) is
\[\begin{equation}
g(\mathbf{t})=g(\mathbf{\theta})+\sum_{i=1}^kg'_i(\mathbf{\theta})(t_i-\theta_i)+Remainder
\tag{3.4}
\end{equation}\]
For statistical approximation, the remainder can be ignored and we end up with
\[\begin{equation}
g(\mathbf{t})\approx g(\mathbf{\theta})+\sum_{i=1}^kg'_i(\mathbf{\theta})(t_i-\theta_i)
\tag{3.5}
\end{equation}\]
Now take expectation on both sides of (3.5)
\[\begin{equation}
E_{\mathbf{\theta}}g(\mathbf{\theta})\approx g(\mathbf{T})+\sum_{i=1}^kg'_i(\mathbf{\theta})E_{\mathbf{T}}(T_i-\theta_i)=g(\mathbf{\theta})
\tag{3.6}
\end{equation}\]
Now approximate the variance of \(g(\mathbf{T})\) by
\[\begin{equation}
\begin{split}
Var_{\mathbf{T}}(g(\mathbf{T}))&\approx E_{\mathbf{T}}([g(\mathbf{T})-g(\mathbf{\theta})]^2)\\
&\approx E_{\mathbf{T}}((\sum_{i=1}^kg'_i(\mathbf{\theta})(T_i-\theta_i))^2)\\
&=\sum_{i=1}^k[g'_i(\mathbf{\theta})]^2Var_{\mathbf{T}}(T_i)+2\sum_{i>j}g'_i(\mathbf{\theta})g'_j(\mathbf{\theta})Cov_{\mathbf{T}}(T_i,T_j)
\end{split}
\tag{3.7}
\end{equation}\]
Theorem 3.1 (Delta Method) Let \(Y_n\) be a sequence of random variables that satisfies \(\sqrt{n}(Y_n-\theta)\to N(0,\sigma^2)\) in distribution. For a given function g and a specific value of \(\theta\), suppose that \(g^{\prime}(\theta)\) exists and is not 0. Then
\[\begin{equation}
\sqrt{n}[g(Y_n)-g(\theta)]\to N(0,\sigma^2[g^{\prime}(\theta)]^2)
\tag{3.8}
\end{equation}\]
in distribution.
Proof. The Taylor expansion of
\(g(Y_n)\) around
\(Y_n=\theta\) is
\[\begin{equation}
g(Y_n)=g(\theta)+g^{\prime}(\theta)(Y_n-\theta)+Remainder
\tag{3.9}
\end{equation}\]
where the remainder
\(\to0\) as
\(Y_n\to\theta\). Now we first show from
\(\sqrt{n}(Y_n-\theta)\to N(0,\sigma^2)\) in distribution we get
\(Y_n\to\theta\) in probability. Since
\(P(|Y_n-\theta|<\epsilon)=P(|\sqrt{n}(Y_n-\theta)|<\sqrt{n}\epsilon)\), we have
\[\begin{equation}
\lim_{n\to\infty}P(|Y_n-\theta|<\epsilon)=\lim_{n\to\infty}P(|\sqrt{n}(Y_n-\theta)|<\sqrt{n}\epsilon)=P(|Z|<\infty)=1
\tag{3.10}
\end{equation}\]
where
\(Z\sim N(0,\sigma^2)\). Thus, we have the remainder converges to 0 in probability. Then by applying part (b) of Slutsky Theorem (Theorem
2.9) we get the result.
Example 3.1 (Approximate mean and variance) Suppose
\(X\) is a random variable with
\(EX=\mu\neq0\). If we want to estimate a function
\(g(\mu)\), a first-order approximation would give us
\(g(X)=g(\mu)+g^{\prime}(\mu)(X-\mu)\). If we use
\(g(X)\) as an estimator of
\(g(\mu)\), then approximately we have
\[\begin{align}
&Eg(X)\approx g(\mu) \tag{3.11}\\
&Var(g(X))\approx [g^{\prime}(\mu)]^2Var(X) \tag{3.12}\\
\end{align}\]
For a specific example, take
\(g(\mu)=1/\mu\), we estimate
\(1/\mu\) with
\(1/X\) which approximately has mean and variance
\[\begin{align}
&E\frac{1}{X}\approx \frac{1}{\mu} \tag{3.13}\\
&Var(\frac{1}{X})\approx [\frac{1}{\mu^4}]Var(X) \tag{3.14}\\
\end{align}\]
Suppose now we have the mean of a random sample
\(\bar{X}\). For
\(\mu\neq0\), by Theorem
3.1 \(\sqrt{n}(\frac{1}{\bar{X}}-\frac{1}{\mu})\to N(0,\frac{1}{\mu^4}Var(X_1))\) in distribution. Usually
\(\mu\) and
\(Var(X_1)\) are both unknown, which can be estimated by
\(\bar{X}\) and
\(S^2\) respectively. Thus, we have the approximate variance
\(\hat{Var}(\frac{1}{\bar{X}})\approx (\frac{1}{\bar{X}^4})S^2\). And since
\(\bar{X}\) and
\(S^2\) are consistent estimator, apply Slutskey theorem again and conclude that for
\(\mu\neq 0\),
\[\begin{equation}
\frac{\sqrt{n}(\frac{1}{\bar{X}}-\frac{1}{\mu})}{(\frac{1}{X})^2S}\to N(0,1)
\tag{3.15}
\end{equation}\]
in distribution.
There are two extensions of basic Delta Method. The first one is dealing with the case when \(g^{\prime}(\theta)=0\). If this happens, we take one more term in Taylor expansion to get
\[\begin{equation}
g(Y_n)=g(\theta)+g^{\prime}(\theta)(Y_n-\theta)+\frac{g^{\prime\prime}(\theta)}{2}(Y_n-\theta)^2+Remainder
\tag{3.16}
\end{equation}\]
Rearranging terms we get
\[\begin{equation}
g(Y_n)-g(\theta)=\frac{g^{\prime\prime}(\theta)}{2}(Y_n-\theta)^2+Remainder
\tag{3.17}
\end{equation}\]
and notice that the square of a \(N(0,1)\) r.v. is a \(\chi_1^2\), which implies \(\frac{n(Y_n-\mu)^2}{\sigma^2}\to\chi_1^2\) in distribution. Therefore, we have the first extension of Delta Method.
Theorem 3.2 (Second-order Delta Method) Let \(Y_n\) be a sequence of random variables that satisfies \(\sqrt{n}(Y_n-\theta)\to N(0,\sigma^2)\) in distribution. For a given function g and a specific value of \(\theta\), suppose that \(g^{\prime}(\theta)=0\), \(g^{\prime\prime}(\theta)\) exists and is not 0. Then
\[\begin{equation}
n[g(Y_n)-g(\theta)]\to \sigma^2\frac{g^{\prime\prime}(\theta)}{2}\chi_1^2
\tag{3.18}
\end{equation}\]
in distribution.
The second extension deals with mulvariate case. First consider the following example dealing with moments of a ratio estimator.
Example 3.2 Suppose X and Y are random variables with nonzero mean \(\mu_X\) and \(\mu_Y\), respectively. The parametrix function to be estimated is \(g(\mu_X,\mu_Y)=\frac{\mu_X}{\mu_Y}\). It is straightforward to calculate \(\frac{\partial}{\partial\mu_X}=\frac{1}{\mu_Y}\) and \(\frac{\partial}{\partial\mu_Y}=\frac{-\mu_X}{\mu^2_Y}\). Then by (3.6) and (3.7) we have
\[\begin{equation}
E(\frac{X}{Y})\approx \frac{\mu_X}{\mu_Y}
\tag{3.19}
\end{equation}\]
and
\[\begin{equation}
\begin{split}
Var(\frac{X}{Y})&\approx \frac{1}{\mu_Y^2}Var(X)+\frac{\mu_X^2}{\mu_Y^4}Var(Y)-2\frac{\mu_X}{\mu_Y^3}Cov(X,Y)\\
&=(\frac{\mu_X}{\mu_Y})^2(\frac{Var(X)}{\mu_X^2}+\frac{Var(Y)}{\mu_Y^2}-2\frac{Cov(X,Y)}{\mu_X\mu_Y})
\end{split}
\tag{3.20}
\end{equation}\]
Thus, we obtain an approximation which use only the means, variance and covariance of
\(X\) and
\(Y\), while the exact calculation is hopeless.
Inspired by the demand of estimation using more than one variable to a function containing more than one parameter, we have the following multivariate case Delta Method.
Theorem 3.3 (Multivariate Delta Method)
Let \(\mathbf{X}_1,\cdots,\mathbf{X}_n\) be a random sample with \(E(X_{ij})=\mu_i\) and \(Cov(X_{ik},X_{jk})=\sigma_{ij}\). For a given function g with continuous first partial derivatives and a specific value of \(\mathbf{\mu}=(\mu_1,\cdots,\mu_p)\) for which \(\tau^2=\sum\sum\sigma_{ij}\frac{\partial g(\mu)}{\partial\mu_i}\cdot\frac{\partial g(\mu)}{\partial\mu_j}>0\),
\[\begin{equation}
\sqrt{n}[g(\bar{X}_1,\cdots,\bar{X}_s)-g(\mu_1,\cdots,\mu_p)]\to N(0,\tau^2)
\tag{3.21}
\end{equation}\]
in distribution.
A sufficient statistic for a parameter \(\theta\) is a statistic that, in a certain sense, captures all the information about \(\theta\) contained in the sample. Any additional information in the sample, besides the value of the sufficient statistic, does not contain any more information about \(\theta\).
Definition 3.2 (SUFFICIENCY PRINCIPLE) If \(T(\mathbf{X})\) is a sufficient statistic of \(\theta\), then any inference about \(\theta\) should depend on the sample \(\mathbf{X}\) only through the value \(T(\mathbf{X})\). That is, if \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points such that \(T(\mathbf{x})=T(\mathbf{y})\), then the inference about \(\theta\) should be the same whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed.
Definition 3.3 (Sufficient Statistic) A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) dose not depend on \(\theta\).
To use Definition 3.3 to verify that a statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\), we must verify for any fixed \(\mathbf{x}\) and \(t\), the conditional probability \(P_{\theta}(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=t)\) is the same for all valuse of \(\theta\). (Here \(P_{\theta}\) means \(\theta\) is a parameter in this distribution.) Since if \(T(\mathbf{x})\neq t\), this probability is 0 regardless of \(\theta\). We only need to verify \(P_{\theta}(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=T(\mathbf{x}))\) does not depend on \(\theta\). Since \(S_1:=\{s\in S: \mathbf{X}(s)=\mathbf{x}\}\) is a subset of \(S_2:=\{s\in S: T(\mathbf{X}(s))=T(\mathbf{x})\}\),
\[\begin{equation}
\begin{split}
P_{\theta}(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=T(\mathbf{x}))&=\frac{P_{\theta}
(\mathbf{X}=\mathbf{x}\, and \, T(\mathbf{X})=T(\mathbf{x}))}{P_{\theta}(T(\mathbf{X})=T(\mathbf{x}))}\\
&=\frac{P_{\theta}(\mathbf{X}=\mathbf{x})}{P_{\theta}(T(\mathbf{X})=T(\mathbf{x}))}\\
&=\frac{p(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}
\end{split}
\tag{3.22}
\end{equation}\]
Here \(p(\mathbf{x}|\theta)\) is the joint pmf of the sample \(\mathbf{X}\) and \(q(t|\theta)\) is the pmf of \(T(\mathbf{X})\). Thus, \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) iff \(\forall \mathbf{x}\), the above ratio of pmfs is constant as a function of \(\theta\).
Theorem 3.4 \(p(\mathbf{x}|\theta)\) is the joint pdf of the sample \(\mathbf{X}\) and \(q(t|\theta)\) is the pdf or pmf of \(T(\mathbf{X})\), then \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if \(\forall \mathbf{x}\) in the sample space, the ratio \(\frac{p(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}\) is constant as a function of \(\theta\).
Example 3.3 (Normal Sufficient Statistic) Let
\(X_1,\cdots,X_n\) be i.i.d.
\(N(\mu,\sigma^2)\) where
\(\sigma^2\) is known. We wish to show that sample mean
\(T(\mathbf{X})=\bar{X}\) is a sufficient statistic for
\(\mu\). The joint pdf of sample
\(\mathbf{X}\) is
\[\begin{equation}
\begin{split}
f(\mathbf{x}|\mu)&=\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\sum_{i=1}^n(x_i-\mu)^2/(2\sigma^2))\\
&=\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\sum_{i=1}^n(x_i-\bar{x}+\bar{x}-\mu)^2/(2\sigma^2))\\
&=\frac{1}{(2\pi\sigma^2)^{n/2}}exp(-(\sum_{i=1}^n(x_i-\bar{x})^2+n(\bar{x}-\mu)^2)/(2\sigma^2))
\end{split}
\tag{3.23}
\end{equation}\]
Since
\(\bar{X}\sim N(\mu,\sigma^2/n)\), the ratio of pdf is
\[\begin{equation}
\begin{split}
\frac{p(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}&=\frac{(2\pi\sigma^2)^{-n/2}exp(-(\sum_{i=1}^n(x_i-\bar{x})^2+n(\bar{x}-\mu)^2)/(2\sigma^2))}{(2\pi\sigma^2/n)^{-1/2}exp(-n(\bar{x}-\mu)^2/(2\sigma^2))}\\
&=n^{-1/2}(2\pi\sigma^2)^{-(n-1)/2}exp(-\sum_{i=1}^n(x_i-\bar{x})^2/(2\sigma^2))
\end{split}
\tag{3.24}
\end{equation}\]
which does not depend on
\(\mu\). Hence by Theorem
3.4, the sample mean is a sufficient statistic for porpulation mean
\(\mu\).