Chapter 5 Sufficient Principle, Minimal Sufficient Statistics (Lecture on 01/16/2020)
We usually find a sufficient statistic by simple inspection of the pdf or pmf of the sample.
Proof. Suppose T(X) is a sufficient statistic. Choose g(t|θ)=Pθ(T(X)=t) and h(x)=P(X=x|T(X)=T(x). Because T(X) is sufficient, the conditional probability defining h(x) dose not depend on θ. Thus, the choice of h(x) and g(t|θ) is legitimate. For this choice, f(x|θ)=Pθ(X=x)=Pθ(X=xandT(X)=T(x))=Pθ(T(X)=T(x))P(X=x|T(X)=T(x))=g(T(x)|θ)h(x) So factorization (5.1) has been exhibited. Also Pθ(T(X)=T(x))=g(T(x)|θ), so g(T(x)|θ) is the pmf of T(X).
Now assume factorization (5.1) exists. Let q(t|θ) be the pmf of T(X). To show that T(X) is sufficient we examine the ratio f(x|θ)/q(T(x)|θ). Define AT(x)={y:T(y)=T(x)}. Then f(x|θ)q(T(x)|θ)=g(T(x)|θ)h(x)q(T(x)|θ)=g(T(x)|θ)h(x)∑AT(x)g(T(y)|θ)h(y)=g(T(x)|θ)h(x)g(T(x)|θ)∑AT(x)h(y)=h(x)∑AT(x)h(y) Since the ratio does not depend on θ, by Theorem 3.4, T(X) is a sufficient statistic for θ.We factor the joint pdf into two parts, one part not depending on θ, which is h(x) function. The other part depends on θ, depends on the sample x only through some function T(x) and this function is a sufficient statistic for θ.
Example 5.1 (Normal Sufficient Statistic for Mean) For the normal model, the pdf can be factored as f(x|μ)=(2πσ2)−n/2exp(−n∑i=1(xi−ˉx)2/(2σ2))exp(−n(ˉx−μ)2/(2σ2))
Therefore, we can define h(x)=(2πσ2)−n/2exp(−∑ni=1(xi−ˉx)2/(2σ2)) which does not depend on the unknown parameter μ. By (5.4), μ depends on the sample x only through the function T(x)=ˉx. Therefore, T(X)=ˉX is a sufficient statistic for μ.Example 5.2 (Uniform Sufficient Statistic) Let X1,⋯,Xn be i.i.d. observations from the discrete uniform distribution on 1,⋯,θ. The pmf of Xi is f(x|θ)={1θx=1,2,⋯,θ0otherwise
Thus, the joint pmf of X1,⋯,Xn is f(x|θ)={θ−nxi∈{1,⋯,θ}fori=1,⋯,n0otherwise
Let N={1,2,⋯} be the set of postive integers and let Nθ={1,2,⋯,θ}. Then the joint pmf of X1,⋯,Xn is f(x|θ)=n∏i=1θ−1INθ(xi)=θ−nn∏i=1INθ(xi)
Defining T(x)=max, then \begin{equation} \prod_{i=1}^nI_{\mathbb{N}_{\theta}}(x_i)=(\prod_{i=1}^nI_{\mathbb{N}}(x_i))I_{\mathbb{N}_{\theta}}(T(\mathbf{x})) \tag{5.8} \end{equation}
Thus, we have the factorization \begin{equation} f(\mathbf{x}|\theta)=\theta^{-n}I_{\mathbb{N}_{\theta}}(T(\mathbf{x}))(\prod_{i=1}^nI_{\mathbb{N}}(x_i)) \tag{5.9} \end{equation} Thus, T(\mathbf{X})=\max_{i}X_i is a sufficient statistic for \theta.Example 5.3 (Normal Sufficient Statisitc for Both Parameters) Assume X_1,\cdots,X_n are i.i.d. N(\mu,\sigma^2) with both parameters unknown. When using Theorem 5.1, any part of the joint pdf that depends on either \mu or \sigma^2 must be include in the g function. Define T_1(\mathbf{x})=\bar{x}, T_2(\mathbf{x})=s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2, and h(\mathbf{x})=1. Then \begin{equation} f(\mathbf{x}|\mu,\sigma^2)=g(T_1(\mathbf{x}),T_2(\mathbf{x})|\mu,\sigma^2)h(\mathbf{x}) \tag{5.10} \end{equation} where g(t_1,t_2|\mu,\sigma^2)=(2\pi\sigma^2)^{-n/2}exp(-(n(t_1-\mu)^2+(n-1)t_2)/(2\sigma^2)).
By the Factorization Theorem, T(\mathbf{X})=(\bar{X},S^2) is a sufficient statistic for (\mu,\sigma^2) in this normal model.Theorem 5.2 Let X_1,\cdots,X_n be i.i.d. observations from a pdf or pmf f(x|\theta) that belongs to an exponential family given by \begin{equation} f(x|\boldsymbol{\theta})=h(x)c(\boldsymbol{\theta})exp(\sum_{i=1}^k\omega_i(\mathbf{\theta})t_i(x)) \tag{5.11} \end{equation} where \boldsymbol{\theta}=(\theta_1,\cdots,\theta_d),d\leq k. Then \begin{equation} T(\mathbf{X})=(\sum_{j=1}^nt_1(X_j),\cdots,\sum_{j=1}^nt_k(X_j)) \tag{5.12} \end{equation} is a sufficient statistic for \boldsymbol{\theta}.
(This is Exercise 6.4 from Casella and Berger (2002))Sufficient statistic always exists and it is not unique. The complete sample \mathbf{X} is a sufficient statistic. (Factorize as f(\mathbf{X}|\theta)=f(T(\mathbf{x})|\theta)h(\mathbf{x}), where T(\mathbf{x})=\mathbf{x} and h(\mathbf{x})=1.)
Any one-to-one function of a sufficient statistic is a sufficient statistic. Suppose T(\mathbf{X}) is a sufficient statistic, r is a bijection with inverse r^{-1}, T^*(\mathbf{x})=r(T(\mathbf{x})), then by Factorization Theorem, \begin{equation} f(\mathbf{X}|\theta)=g(T(\mathbf{x})|\theta)h(\mathbf{x})=g(r^{-1}(T^*(\mathbf{x}))|\theta)h(\mathbf{x}) \tag{5.14} \end{equation} Defining g^*(t|\theta)=g(r^{-1}(t)|\theta) the result is obvious.We are interested in finding a statistic that achieves the most data reduction while still retaining all the information about parameter \theta.
Sufficient statistic can be thought as partition of sample space \mathcal{X}. Let \mathcal{T}=\{t:t=T(\mathbf{x}),\mathbf{x}\in\mathcal{X}\}, then T(\mathbf{x}) partitions the sample space into sets A_t, t\in\mathcal{T}, defined by A_t:=\{\mathbf{x}\in\mathcal{X}:T(\mathbf{x})=t\}. If \{B_{t^{\prime}}:t^{\prime}\in\mathcal{T}^{\prime}\} are the partition sets for T^{\prime}(\mathbf{x}) and \{A_t:t\in\mathcal{T}\} are the partition sets for T(\mathbf{x}), then Definition 5.1 states that every B_{t^{\prime}} is a subset of A_t. Thus, the partition associated with a minimal sufficient statistic is the coarsest possible partition for a sufficient statistic.
Proof. Assume f(\mathbf{x}|\theta)>0 for all \mathbf{x}\in\mathcal{X} and \theta.
First, show T(\mathbf{X}) is a sufficient statistic. Let \mathcal{T}=\{t:t=T(\mathbf{x}),\mathbf{x}\in\mathcal{X}\}. Define the partition sets induced by T(\mathbf{x}) as A_t:=\{\mathbf{x}\in\mathcal{X}:T(\mathbf{x})=t\}. For each A_t, choose and fix one element \mathbf{x}_t\in A_t. For any \mathbf{x}\in\mathcal{X}, \mathbf{x}_{T{\mathbf{x}}} is the fixed element that is in the same set A_{T(\mathbf{x})} as \mathbf{x}, which implies T(\mathbf{x})=T(\mathbf{x}_{T{\mathbf{x}}}) and hence f(\mathbf{x}|\theta)/f(\mathbf{x}_{T{\mathbf{x}}}|\theta) is a constant as a function of \theta. Thus, we define a function on \mathcal{X} by h(\mathbf{x})=f(\mathbf{x}|\theta)/f(\mathbf{x}_{T{\mathbf{x}}}|\theta) and h does not depend on \theta. Define a function on \mathcal{T} by g(t|\theta)=f(\mathbf{x}_t|\theta) then \begin{equation} f(\mathbf{x}|\theta)=\frac{f(\mathbf{x}_{T{\mathbf{x}}}|\theta)f(\mathbf{x}|\theta)}{f(\mathbf{x}_{T{\mathbf{x}}}|\theta)}=g(T(\mathbf{x})|\theta)h(\mathbf{x}) \tag{5.15} \end{equation} and by Theorem 5.1, T(\mathbf{x}) is a sufficient statistic for \theta.
Now to show T(\mathbf{X}) is minimal, let T^{\prime}(\mathbf{X}) be any other sufficient statistic. By Theorem 5.1, there exist functions g^{\prime} and h^{\prime} such that f(\mathbf{x}|\theta)=g^{\prime}(T^{\prime}(\mathbf{x})|\theta)h^{\prime}(\mathbf{x}). Let \mathbf{x} and \mathbf{y} be any two sample points such that T^{\prime}(\mathbf{x})=T^{\prime}(\mathbf{y}), then \begin{equation} \frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}=\frac{g^{\prime}(T^{\prime}(\mathbf{x})|\theta)h^{\prime}(\mathbf{x})}{g^{\prime}(T^{\prime}(\mathbf{y})|\theta)h^{\prime}(\mathbf{y})}=\frac{h^{\prime}(\mathbf{x})}{h^{\prime}(\mathbf{y})} \tag{5.16} \end{equation} Since the ratio does not depend on \theta, the assumption of theorem implies that T(\mathbf{x})=T(\mathbf{y}). Then by definition we conclude that T(\mathbf{X}) is minimal.References
Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.