Chapter 5 Sufficient Principle, Minimal Sufficient Statistics (Lecture on 01/16/2020)
We usually find a sufficient statistic by simple inspection of the pdf or pmf of the sample.
Proof. Suppose \(T(\mathbf{X})\) is a sufficient statistic. Choose \(g(t|\theta)=P_{\theta}(T(\mathbf{X})=t)\) and \(h(\mathbf{x})=P(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=T(\mathbf{x})\). Because \(T(\mathbf{X})\) is sufficient, the conditional probability defining \(h(\mathbf{x})\) dose not depend on \(\theta\). Thus, the choice of \(h(\mathbf{x})\) and \(g(t|\theta)\) is legitimate. For this choice, \[\begin{equation} \begin{split} f(\mathbf{x}|\theta)&=P_{\theta}(\mathbf{X}=\mathbf{x})\\ &=P_{\theta}(\mathbf{X}=\mathbf{x}\,and\,T(\mathbf{X})=T(\mathbf{x}))\\ &=P_{\theta}(T(\mathbf{X})=T(\mathbf{x}))P(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=T(\mathbf{x}))\\ &=g(T(\mathbf{x})|\theta)h(x) \end{split} \tag{5.2} \end{equation}\] So factorization (5.1) has been exhibited. Also \(P_{\theta}(T(\mathbf{X})=T(\mathbf{x}))=g(T(\mathbf{x})|\theta)\), so \(g(T(\mathbf{x})|\theta)\) is the pmf of \(T(\mathbf{X})\).
Now assume factorization (5.1) exists. Let \(q(t|\theta)\) be the pmf of \(T(\mathbf{X})\). To show that \(T(\mathbf{X})\) is sufficient we examine the ratio \(f(\mathbf{x}|\theta)/q(T(\mathbf{x})|\theta)\). Define \(A_{T(\mathbf{x})}=\{\mathbf{y}:T(\mathbf{y})=T(\mathbf{x})\}\). Then \[\begin{equation} \begin{split} \frac{f(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}&=\frac{g(T(\mathbf{x})|\theta)h(\mathbf{x})}{q(T(\mathbf{x})|\theta)}\\ &=\frac{g(T(\mathbf{x})|\theta)h(\mathbf{x})}{\sum_{A_{T(\mathbf{x})}}g(T(\mathbf{y})|\theta)h(\mathbf{y})}\\ &=\frac{g(T(\mathbf{x})|\theta)h(\mathbf{x})}{g(T(\mathbf{x})|\theta)\sum_{A_{T(\mathbf{x})}}h(\mathbf{y})}\\ &=\frac{h(\mathbf{x})}{\sum_{A_{T(\mathbf{x})}}h(\mathbf{y})} \end{split} \tag{5.3} \end{equation}\] Since the ratio does not depend on \(\theta\), by Theorem 3.4, \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\).We factor the joint pdf into two parts, one part not depending on \(\theta\), which is \(h(\mathbf{x})\) function. The other part depends on \(\theta\), depends on the sample \(\mathbf{x}\) only through some function \(T(\mathbf{x})\) and this function is a sufficient statistic for \(\theta\).
Example 5.1 (Normal Sufficient Statistic for Mean) For the normal model, the pdf can be factored as \[\begin{equation} f(\mathbf{x}|\mu)=(2\pi\sigma^2)^{-n/2}exp(-\sum_{i=1}^n(x_i-\bar{x})^2/(2\sigma^2))exp(-n(\bar{x}-\mu)^2/(2\sigma^2)) \tag{5.4} \end{equation}\]
Therefore, we can define \(h(\mathbf{x})=(2\pi\sigma^2)^{-n/2}exp(-\sum_{i=1}^n(x_i-\bar{x})^2/(2\sigma^2))\) which does not depend on the unknown parameter \(\mu\). By (5.4), \(\mu\) depends on the sample \(\mathbf{x}\) only through the function \(T(\mathbf{x})=\bar{x}\). Therefore, \(T(\mathbf{X})=\bar{\mathbf{X}}\) is a sufficient statistic for \(\mu\).Example 5.2 (Uniform Sufficient Statistic) Let \(X_1,\cdots,X_n\) be i.i.d. observations from the discrete uniform distribution on \(1,\cdots,\theta\). The pmf of \(\mathbf{X}_i\) is \[\begin{equation} f(x|\theta)=\left\{ \begin{aligned} &\frac{1}{\theta} &\quad x=1,2,\cdots,\theta \\ & 0 & \quad otherwise \end{aligned} \right. \tag{5.5} \end{equation}\]
Thus, the joint pmf of \(X_1,\cdots,X_n\) is \[\begin{equation} f(\mathbf{x}|\theta)=\left\{ \begin{aligned} &\theta^{-n} &\quad x_i\in\{1,\cdots,\theta\}\,for\,i=1,\cdots,n \\ & 0 & \quad otherwise \end{aligned} \right. \tag{5.6} \end{equation}\]
Let \(\mathbb{N}=\{1,2,\cdots\}\) be the set of postive integers and let \(\mathbf{N}_{\theta}=\{1,2,\cdots,\theta\}\). Then the joint pmf of \(X_1,\cdots,X_n\) is \[\begin{equation} f(\mathbf{x}|\theta)=\prod_{i=1}^n\theta^{-1}I_{\mathbb{N}_{\theta}}(x_i)=\theta^{-n}\prod_{i=1}^nI_{\mathbb{N}_{\theta}}(x_i) \tag{5.7} \end{equation}\]
Defining \(T(\mathbf{x})=\max_{i}x_i\), then \[\begin{equation} \prod_{i=1}^nI_{\mathbb{N}_{\theta}}(x_i)=(\prod_{i=1}^nI_{\mathbb{N}}(x_i))I_{\mathbb{N}_{\theta}}(T(\mathbf{x})) \tag{5.8} \end{equation}\]
Thus, we have the factorization \[\begin{equation} f(\mathbf{x}|\theta)=\theta^{-n}I_{\mathbb{N}_{\theta}}(T(\mathbf{x}))(\prod_{i=1}^nI_{\mathbb{N}}(x_i)) \tag{5.9} \end{equation}\] Thus, \(T(\mathbf{X})=\max_{i}X_i\) is a sufficient statistic for \(\theta\).Example 5.3 (Normal Sufficient Statisitc for Both Parameters) Assume \(X_1,\cdots,X_n\) are i.i.d. \(N(\mu,\sigma^2)\) with both parameters unknown. When using Theorem 5.1, any part of the joint pdf that depends on either \(\mu\) or \(\sigma^2\) must be include in the g function. Define \(T_1(\mathbf{x})=\bar{x}\), \(T_2(\mathbf{x})=s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2\), and \(h(\mathbf{x})=1\). Then \[\begin{equation} f(\mathbf{x}|\mu,\sigma^2)=g(T_1(\mathbf{x}),T_2(\mathbf{x})|\mu,\sigma^2)h(\mathbf{x}) \tag{5.10} \end{equation}\] where \(g(t_1,t_2|\mu,\sigma^2)=(2\pi\sigma^2)^{-n/2}exp(-(n(t_1-\mu)^2+(n-1)t_2)/(2\sigma^2))\).
By the Factorization Theorem, \(T(\mathbf{X})=(\bar{X},S^2)\) is a sufficient statistic for \((\mu,\sigma^2)\) in this normal model.Theorem 5.2 Let \(X_1,\cdots,X_n\) be i.i.d. observations from a pdf or pmf \(f(x|\theta)\) that belongs to an exponential family given by \[\begin{equation} f(x|\boldsymbol{\theta})=h(x)c(\boldsymbol{\theta})exp(\sum_{i=1}^k\omega_i(\mathbf{\theta})t_i(x)) \tag{5.11} \end{equation}\] where \(\boldsymbol{\theta}=(\theta_1,\cdots,\theta_d),d\leq k\). Then \[\begin{equation} T(\mathbf{X})=(\sum_{j=1}^nt_1(X_j),\cdots,\sum_{j=1}^nt_k(X_j)) \tag{5.12} \end{equation}\] is a sufficient statistic for \(\boldsymbol{\theta}\).
(This is Exercise 6.4 from Casella and Berger (2002))Sufficient statistic always exists and it is not unique. The complete sample \(\mathbf{X}\) is a sufficient statistic. (Factorize as \(f(\mathbf{X}|\theta)=f(T(\mathbf{x})|\theta)h(\mathbf{x})\), where \(T(\mathbf{x})=\mathbf{x}\) and \(h(\mathbf{x})=1\).)
Any one-to-one function of a sufficient statistic is a sufficient statistic. Suppose \(T(\mathbf{X})\) is a sufficient statistic, \(r\) is a bijection with inverse \(r^{-1}\), \(T^*(\mathbf{x})=r(T(\mathbf{x}))\), then by Factorization Theorem, \[\begin{equation} f(\mathbf{X}|\theta)=g(T(\mathbf{x})|\theta)h(\mathbf{x})=g(r^{-1}(T^*(\mathbf{x}))|\theta)h(\mathbf{x}) \tag{5.14} \end{equation}\] Defining \(g^*(t|\theta)=g(r^{-1}(t)|\theta)\) the result is obvious.We are interested in finding a statistic that achieves the most data reduction while still retaining all the information about parameter \(\theta\).
Sufficient statistic can be thought as partition of sample space \(\mathcal{X}\). Let \(\mathcal{T}=\{t:t=T(\mathbf{x}),\mathbf{x}\in\mathcal{X}\}\), then \(T(\mathbf{x})\) partitions the sample space into sets \(A_t\), \(t\in\mathcal{T}\), defined by \(A_t:=\{\mathbf{x}\in\mathcal{X}:T(\mathbf{x})=t\}\). If \(\{B_{t^{\prime}}:t^{\prime}\in\mathcal{T}^{\prime}\}\) are the partition sets for \(T^{\prime}(\mathbf{x})\) and \(\{A_t:t\in\mathcal{T}\}\) are the partition sets for \(T(\mathbf{x})\), then Definition 5.1 states that every \(B_{t^{\prime}}\) is a subset of \(A_t\). Thus, the partition associated with a minimal sufficient statistic is the coarsest possible partition for a sufficient statistic.
Proof. Assume \(f(\mathbf{x}|\theta)>0\) for all \(\mathbf{x}\in\mathcal{X}\) and \(\theta\).
First, show \(T(\mathbf{X})\) is a sufficient statistic. Let \(\mathcal{T}=\{t:t=T(\mathbf{x}),\mathbf{x}\in\mathcal{X}\}\). Define the partition sets induced by \(T(\mathbf{x})\) as \(A_t:=\{\mathbf{x}\in\mathcal{X}:T(\mathbf{x})=t\}\). For each \(A_t\), choose and fix one element \(\mathbf{x}_t\in A_t\). For any \(\mathbf{x}\in\mathcal{X}\), \(\mathbf{x}_{T{\mathbf{x}}}\) is the fixed element that is in the same set \(A_{T(\mathbf{x})}\) as \(\mathbf{x}\), which implies \(T(\mathbf{x})=T(\mathbf{x}_{T{\mathbf{x}}})\) and hence \(f(\mathbf{x}|\theta)/f(\mathbf{x}_{T{\mathbf{x}}}|\theta)\) is a constant as a function of \(\theta\). Thus, we define a function on \(\mathcal{X}\) by \(h(\mathbf{x})=f(\mathbf{x}|\theta)/f(\mathbf{x}_{T{\mathbf{x}}}|\theta)\) and \(h\) does not depend on \(\theta\). Define a function on \(\mathcal{T}\) by \(g(t|\theta)=f(\mathbf{x}_t|\theta)\) then \[\begin{equation} f(\mathbf{x}|\theta)=\frac{f(\mathbf{x}_{T{\mathbf{x}}}|\theta)f(\mathbf{x}|\theta)}{f(\mathbf{x}_{T{\mathbf{x}}}|\theta)}=g(T(\mathbf{x})|\theta)h(\mathbf{x}) \tag{5.15} \end{equation}\] and by Theorem 5.1, \(T(\mathbf{x})\) is a sufficient statistic for \(\theta\).
Now to show \(T(\mathbf{X})\) is minimal, let \(T^{\prime}(\mathbf{X})\) be any other sufficient statistic. By Theorem 5.1, there exist functions \(g^{\prime}\) and \(h^{\prime}\) such that \(f(\mathbf{x}|\theta)=g^{\prime}(T^{\prime}(\mathbf{x})|\theta)h^{\prime}(\mathbf{x})\). Let \(\mathbf{x}\) and \(\mathbf{y}\) be any two sample points such that \(T^{\prime}(\mathbf{x})=T^{\prime}(\mathbf{y})\), then \[\begin{equation} \frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}=\frac{g^{\prime}(T^{\prime}(\mathbf{x})|\theta)h^{\prime}(\mathbf{x})}{g^{\prime}(T^{\prime}(\mathbf{y})|\theta)h^{\prime}(\mathbf{y})}=\frac{h^{\prime}(\mathbf{x})}{h^{\prime}(\mathbf{y})} \tag{5.16} \end{equation}\] Since the ratio does not depend on \(\theta\), the assumption of theorem implies that \(T(\mathbf{x})=T(\mathbf{y})\). Then by definition we conclude that \(T(\mathbf{X})\) is minimal.References
Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.