Chapter 5 Sufficient Principle, Minimal Sufficient Statistics (Lecture on 01/16/2020)

We usually find a sufficient statistic by simple inspection of the pdf or pmf of the sample.

Theorem 5.1 (Factorization Theorem) Let

$f(x|\theta)$ denote the joint pdf or pmf of a sample

$\mathbf{X}$ . A statistic

$T(\mathbf{X})$ is a sufficient statistic for

$\theta$ if and only if there exist functions

$g(t|\theta)$ and

$h(\mathbf{x})$ such that, for all sample points

$\mathbf{x}$ and all parameter points

$\theta$ ,

$\begin{equation} f(\mathbf{x}|\theta)=g(T(\mathbf{x})|\theta)h(\mathbf{x}) \tag{5.1} \end{equation}$

Proof. Suppose $T(\mathbf{X})$ is a sufficient statistic. Choose $g(t|\theta)=P_{\theta}(T(\mathbf{X})=t)$ and $h(\mathbf{x})=P(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=T(\mathbf{x})$ . Because $T(\mathbf{X})$ is sufficient, the conditional probability defining $h(\mathbf{x})$ dose not depend on $\theta$ . Thus, the choice of $h(\mathbf{x})$ and $g(t|\theta)$ is legitimate. For this choice, $\begin{equation} \begin{split} f(\mathbf{x}|\theta)&=P_{\theta}(\mathbf{X}=\mathbf{x})\\ &=P_{\theta}(\mathbf{X}=\mathbf{x}\,and\,T(\mathbf{X})=T(\mathbf{x}))\\ &=P_{\theta}(T(\mathbf{X})=T(\mathbf{x}))P(\mathbf{X}=\mathbf{x}|T(\mathbf{X})=T(\mathbf{x}))\\ &=g(T(\mathbf{x})|\theta)h(x) \end{split} \tag{5.2} \end{equation}$ So factorization (5.1) has been exhibited. Also $P_{\theta}(T(\mathbf{X})=T(\mathbf{x}))=g(T(\mathbf{x})|\theta)$ , so $g(T(\mathbf{x})|\theta)$ is the pmf of $T(\mathbf{X})$ .

Now assume factorization (5.1) exists. Let

$q(t|\theta)$ be the pmf of

$T(\mathbf{X})$ . To show that

$T(\mathbf{X})$ is sufficient we examine the ratio

$f(\mathbf{x}|\theta)/q(T(\mathbf{x})|\theta)$ . Define

$A_{T(\mathbf{x})}=\{\mathbf{y}:T(\mathbf{y})=T(\mathbf{x})\}$ . Then

$\begin{equation} \begin{split} \frac{f(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}&=\frac{g(T(\mathbf{x})|\theta)h(\mathbf{x})}{q(T(\mathbf{x})|\theta)}\\ &=\frac{g(T(\mathbf{x})|\theta)h(\mathbf{x})}{\sum_{A_{T(\mathbf{x})}}g(T(\mathbf{y})|\theta)h(\mathbf{y})}\\ &=\frac{g(T(\mathbf{x})|\theta)h(\mathbf{x})}{g(T(\mathbf{x})|\theta)\sum_{A_{T(\mathbf{x})}}h(\mathbf{y})}\\ &=\frac{h(\mathbf{x})}{\sum_{A_{T(\mathbf{x})}}h(\mathbf{y})} \end{split} \tag{5.3} \end{equation}$ Since the ratio does not depend on

$\theta$ , by Theorem 3.4,

$T(\mathbf{X})$ is a sufficient statistic for

$\theta$ .

We factor the joint pdf into two parts, one part not depending on $\theta$ , which is $h(\mathbf{x})$ function. The other part depends on $\theta$ , depends on the sample $\mathbf{x}$ only through some function $T(\mathbf{x})$ and this function is a sufficient statistic for $\theta$ .

Example 5.1 (Normal Sufficient Statistic for Mean) For the normal model, the pdf can be factored as $\begin{equation} f(\mathbf{x}|\mu)=(2\pi\sigma^2)^{-n/2}exp(-\sum_{i=1}^n(x_i-\bar{x})^2/(2\sigma^2))exp(-n(\bar{x}-\mu)^2/(2\sigma^2)) \tag{5.4} \end{equation}$

Therefore, we can define

$h(\mathbf{x})=(2\pi\sigma^2)^{-n/2}exp(-\sum_{i=1}^n(x_i-\bar{x})^2/(2\sigma^2))$ which does not depend on the unknown parameter

$\mu$ . By (5.4),

$\mu$ depends on the sample

$\mathbf{x}$ only through the function

$T(\mathbf{x})=\bar{x}$ . Therefore,

$T(\mathbf{X})=\bar{\mathbf{X}}$ is a sufficient statistic for

$\mu$ .

Example 5.2 (Uniform Sufficient Statistic) Let $X_1,\cdots,X_n$ be i.i.d. observations from the discrete uniform distribution on $1,\cdots,\theta$ . The pmf of $\mathbf{X}_i$ is $\begin{equation} f(x|\theta)=\left\{ \begin{aligned} &\frac{1}{\theta} &\quad x=1,2,\cdots,\theta \\ & 0 & \quad otherwise \end{aligned} \right. \tag{5.5} \end{equation}$

Thus, the joint pmf of $X_1,\cdots,X_n$ is $\begin{equation} f(\mathbf{x}|\theta)=\left\{ \begin{aligned} &\theta^{-n} &\quad x_i\in\{1,\cdots,\theta\}\,for\,i=1,\cdots,n \\ & 0 & \quad otherwise \end{aligned} \right. \tag{5.6} \end{equation}$

Let $\mathbb{N}=\{1,2,\cdots\}$ be the set of postive integers and let $\mathbf{N}_{\theta}=\{1,2,\cdots,\theta\}$ . Then the joint pmf of $X_1,\cdots,X_n$ is $\begin{equation} f(\mathbf{x}|\theta)=\prod_{i=1}^n\theta^{-1}I_{\mathbb{N}_{\theta}}(x_i)=\theta^{-n}\prod_{i=1}^nI_{\mathbb{N}_{\theta}}(x_i) \tag{5.7} \end{equation}$

Defining $T(\mathbf{x})=\max_{i}x_i$ , then $\begin{equation} \prod_{i=1}^nI_{\mathbb{N}_{\theta}}(x_i)=(\prod_{i=1}^nI_{\mathbb{N}}(x_i))I_{\mathbb{N}_{\theta}}(T(\mathbf{x})) \tag{5.8} \end{equation}$

Thus, we have the factorization

$\begin{equation} f(\mathbf{x}|\theta)=\theta^{-n}I_{\mathbb{N}_{\theta}}(T(\mathbf{x}))(\prod_{i=1}^nI_{\mathbb{N}}(x_i)) \tag{5.9} \end{equation}$ Thus,

$T(\mathbf{X})=\max_{i}X_i$ is a sufficient statistic for

$\theta$ .

A sufficient statistic can be a vector, say

$T(\mathbf{X})=(T_1(\mathbf{X}),\cdots,T_r(\mathbf{X}))$ . This situation often occues when the parameter is also a vector, say

$\boldsymbol{\theta}=(\theta_1,\cdots,\theta_s)$ . It is usually the case that the sufficient statistic and the parameter vectors are of equal length, but different combinations of lengths are possible.

Example 5.3 (Normal Sufficient Statisitc for Both Parameters) Assume $X_1,\cdots,X_n$ are i.i.d. $N(\mu,\sigma^2)$ with both parameters unknown. When using Theorem 5.1, any part of the joint pdf that depends on either $\mu$ or $\sigma^2$ must be include in the g function. Define $T_1(\mathbf{x})=\bar{x}$ , $T_2(\mathbf{x})=s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2$ , and $h(\mathbf{x})=1$ . Then $\begin{equation} f(\mathbf{x}|\mu,\sigma^2)=g(T_1(\mathbf{x}),T_2(\mathbf{x})|\mu,\sigma^2)h(\mathbf{x}) \tag{5.10} \end{equation}$ where $g(t_1,t_2|\mu,\sigma^2)=(2\pi\sigma^2)^{-n/2}exp(-(n(t_1-\mu)^2+(n-1)t_2)/(2\sigma^2))$ .

By the Factorization Theorem,

$T(\mathbf{X})=(\bar{X},S^2)$ is a sufficient statistic for

$(\mu,\sigma^2)$ in this normal model.

Theorem 5.2 Let $X_1,\cdots,X_n$ be i.i.d. observations from a pdf or pmf $f(x|\theta)$ that belongs to an exponential family given by $\begin{equation} f(x|\boldsymbol{\theta})=h(x)c(\boldsymbol{\theta})exp(\sum_{i=1}^k\omega_i(\mathbf{\theta})t_i(x)) \tag{5.11} \end{equation}$ where $\boldsymbol{\theta}=(\theta_1,\cdots,\theta_d),d\leq k$ . Then $\begin{equation} T(\mathbf{X})=(\sum_{j=1}^nt_1(X_j),\cdots,\sum_{j=1}^nt_k(X_j)) \tag{5.12} \end{equation}$ is a sufficient statistic for $\boldsymbol{\theta}$ .

(This is Exercise 6.4 from Casella and Berger (2002))

Proof. The joint pdf is

$\begin{equation} c^n(\boldsymbol{\theta})exp(\sum_{i=1}^k[\omega_i(\mathbf{\theta})\sum_{j=1}^nt_i(x_j)])\prod_{i=1}^nh(x_i) \tag{5.13} \end{equation}$ from which using Factorization Theorem, the conclusion is obvious.

Sufficient statistic always exists and it is not unique. The complete sample $\mathbf{X}$ is a sufficient statistic. (Factorize as $f(\mathbf{X}|\theta)=f(T(\mathbf{x})|\theta)h(\mathbf{x})$ , where $T(\mathbf{x})=\mathbf{x}$ and $h(\mathbf{x})=1$ .)

Any one-to-one function of a sufficient statistic is a sufficient statistic. Suppose

$T(\mathbf{X})$ is a sufficient statistic,

$r$ is a bijection with inverse

$r^{-1}$ ,

$T^*(\mathbf{x})=r(T(\mathbf{x}))$ , then by Factorization Theorem,

$\begin{equation} f(\mathbf{X}|\theta)=g(T(\mathbf{x})|\theta)h(\mathbf{x})=g(r^{-1}(T^*(\mathbf{x}))|\theta)h(\mathbf{x}) \tag{5.14} \end{equation}$ Defining

$g^*(t|\theta)=g(r^{-1}(t)|\theta)$ the result is obvious.

We are interested in finding a statistic that achieves the most data reduction while still retaining all the information about parameter $\theta$ .

Definition 5.1 (Minimal Sufficient Statistic) A sufficient statistic

$T(\mathbf{X})$ is called a minimal sufficient statistic if, for any other sufficent statistic

$T^{\prime}(\mathbf{X})$ ,

$T(\mathbf{x})$ is a function of

$T^{\prime}(\mathbf{x})$ .

$T(\mathbf{x})$ is a function of

$T^{\prime}(\mathbf{x})$ means that if

$T^{\prime}(\mathbf{x})=T^{\prime}(\mathbf{y})$ , then

$T(\mathbf{x})=T(\mathbf{y})$ .

Sufficient statistic can be thought as partition of sample space $\mathcal{X}$ . Let $\mathcal{T}=\{t:t=T(\mathbf{x}),\mathbf{x}\in\mathcal{X}\}$ , then $T(\mathbf{x})$ partitions the sample space into sets $A_t$ , $t\in\mathcal{T}$ , defined by $A_t:=\{\mathbf{x}\in\mathcal{X}:T(\mathbf{x})=t\}$ . If $\{B_{t^{\prime}}:t^{\prime}\in\mathcal{T}^{\prime}\}$ are the partition sets for $T^{\prime}(\mathbf{x})$ and $\{A_t:t\in\mathcal{T}\}$ are the partition sets for $T(\mathbf{x})$ , then Definition 5.1 states that every $B_{t^{\prime}}$ is a subset of $A_t$ . Thus, the partition associated with a minimal sufficient statistic is the coarsest possible partition for a sufficient statistic.

Theorem 5.3 Let

$f(\mathbf{x}|\theta)$ be the pmf or pdf of a sample

$\mathbf{X}$ . Suppose there exists a function

$T(\mathbf{x})$ such that for every two sample points

$\mathbf{x}$ and

$\mathbf{y}$ , the ratio

$f(\mathbf{x}|\theta)/f(\mathbf{y}|\theta)$ is a constant as a function of

$\theta$ if and only if

$T(\mathbf{x})=T(\mathbf{y})$ . Then,

$T(\mathbf{X})$ is a minimal sufficient statistic for

$\theta$ .

Proof. Assume $f(\mathbf{x}|\theta)>0$ for all $\mathbf{x}\in\mathcal{X}$ and $\theta$ .

First, show $T(\mathbf{X})$ is a sufficient statistic. Let $\mathcal{T}=\{t:t=T(\mathbf{x}),\mathbf{x}\in\mathcal{X}\}$ . Define the partition sets induced by $T(\mathbf{x})$ as $A_t:=\{\mathbf{x}\in\mathcal{X}:T(\mathbf{x})=t\}$ . For each $A_t$ , choose and fix one element $\mathbf{x}_t\in A_t$ . For any $\mathbf{x}\in\mathcal{X}$ , $\mathbf{x}_{T{\mathbf{x}}}$ is the fixed element that is in the same set $A_{T(\mathbf{x})}$ as $\mathbf{x}$ , which implies $T(\mathbf{x})=T(\mathbf{x}_{T{\mathbf{x}}})$ and hence $f(\mathbf{x}|\theta)/f(\mathbf{x}_{T{\mathbf{x}}}|\theta)$ is a constant as a function of $\theta$ . Thus, we define a function on $\mathcal{X}$ by $h(\mathbf{x})=f(\mathbf{x}|\theta)/f(\mathbf{x}_{T{\mathbf{x}}}|\theta)$ and $h$ does not depend on $\theta$ . Define a function on $\mathcal{T}$ by $g(t|\theta)=f(\mathbf{x}_t|\theta)$ then $\begin{equation} f(\mathbf{x}|\theta)=\frac{f(\mathbf{x}_{T{\mathbf{x}}}|\theta)f(\mathbf{x}|\theta)}{f(\mathbf{x}_{T{\mathbf{x}}}|\theta)}=g(T(\mathbf{x})|\theta)h(\mathbf{x}) \tag{5.15} \end{equation}$ and by Theorem 5.1, $T(\mathbf{x})$ is a sufficient statistic for $\theta$ .

Now to show

$T(\mathbf{X})$ is minimal, let

$T^{\prime}(\mathbf{X})$ be any other sufficient statistic. By Theorem 5.1, there exist functions

$g^{\prime}$ and

$h^{\prime}$ such that

$f(\mathbf{x}|\theta)=g^{\prime}(T^{\prime}(\mathbf{x})|\theta)h^{\prime}(\mathbf{x})$ . Let

$\mathbf{x}$ and

$\mathbf{y}$ be any two sample points such that

$T^{\prime}(\mathbf{x})=T^{\prime}(\mathbf{y})$ , then

$\begin{equation} \frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}=\frac{g^{\prime}(T^{\prime}(\mathbf{x})|\theta)h^{\prime}(\mathbf{x})}{g^{\prime}(T^{\prime}(\mathbf{y})|\theta)h^{\prime}(\mathbf{y})}=\frac{h^{\prime}(\mathbf{x})}{h^{\prime}(\mathbf{y})} \tag{5.16} \end{equation}$ Since the ratio does not depend on

$\theta$ , the assumption of theorem implies that

$T(\mathbf{x})=T(\mathbf{y})$ . Then by definition we conclude that

$T(\mathbf{X})$ is minimal.

Example 5.4 (Normal Minimal Sufficient Statisitc) Assume

$X_1,\cdots,X_n$ are i.i.d.

$N(\mu,\sigma^2)$ with both parameters unknown. Let

$\mathbf{x}$ and

$\mathbf{y}$ be two sample points, and let

$(\bar{x},s_{\mathbf{x}}^2)$ and

$(\bar{y},s_{\mathbf{y}}^2)$ be the sample means and variances corresponding to

$\mathbf{x}$ and

$\mathbf{y}$ samples, respectively. Then the ratio of densities is

$\begin{equation} \begin{split} \frac{f(\mathbf{x}|\mu,\sigma^2)}{f(\mathbf{y}|\mu,\sigma^2)}&=\frac {(2\pi\sigma^2)^{-n/2}exp(-[n(\bar{x}-\mu)^2+(n-1)s_{\mathbf{x}}^2]/(2\sigma^2))} {(2\pi\sigma^2)^{-n/2}exp(-[n(\bar{y}-\mu)^2+(n-1)s_{\mathbf{y}}^2]/(2\sigma^2))}\\ &=exp([-n(\bar{x}^2-\bar{y}^2)+2n\mu(\bar{x}-\bar{y})-(n-1)(s_{\mathbf{x}}^2-s_{\mathbf{y}}^2)]/(2\sigma^2)) \end{split} \tag{5.17} \end{equation}$ This ratio will be a constant as a function of

$\mu$ and

$\sigma^2$ if and only if

$\bar{x}=\bar{y}$ and

$s_{\mathbf{x}}^2=s_{\mathbf{y}}^2$ . Thus, by Theorem 5.3,

$(\bar{X},S^2)$ is a minimal sufficient statistic for

$(\mu,\sigma^2)$ .

Example 5.5 (Uniform Minimal Sufficient Statistic) Suppose

$X_1,\cdots,X_n$ are i.i.d. uniform observations on the interval

$(\theta,\theta+1),-\infty<\theta<\infty$ . Then the joint pdf of

$\mathbf{X}$ is

$\begin{equation} \begin{split} f(\mathbf{x}|\theta)&=\left\{\begin{aligned} & 1 & \quad \theta<x_i<\theta+1,i=1,\cdots,n\\ & 0 & \quad otherwise \end{aligned} \right.\\ &=\left\{\begin{aligned} & 1 & \quad \max_ix_i-1<\theta<\min_{i}x_i\\ & 0 & \quad otherwise \end{aligned} \right. \end{split} \tag{5.18} \end{equation}$ Thus, for two sample points

$\mathbf{x}$ and

$\mathbf{y}$ , the numerator and denominator of the ratio

$f(\mathbf{x}|\theta)=f(\mathbf{y}|\theta)$ will be positive for the same value of

$\theta$ if and only if

$\min_{i}x_i=\min_iy_i$ and

$\max_ix_i=\max_iy_i$ . And, if the minima and maxima are equal, then the ratio is constant 1. Thus,

$T(\mathbf{X})=(X_{(1)},X_{(n)})$ is a minimal sufficient statistic. Note in this case the dimension of sufficient statistic does not match the dimension of the parameter.

Minimal sufficient statistic is not unique. Any one-to-one function of a minimal sufficient statistic is also a minimal sufficient statistic.

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.