3.4 Sufficient statistics

As we will see below, a statistic is sufficient for a parameter \(\theta\) if it collects all the useful information of the sample about \(\theta.\) In other words: if for a given sample we know the value of a sufficient statistic, then the sample does not offer any additional information about \(\theta\) and we could safely destroy it.

The following example motivates the definition of sufficiency.

Example 3.19 Assume that an experiment with two possible results, success and fail, is repeated \(n\) times, in a way that \((X_1\ldots,X_n)\) is a srs of a \(\mathrm{Ber}(p).\) If we compute the value of the statistic \(Y=\sum_{i=1}^n X_i\sim\mathrm{Bin}(n,p),\) does the sample provide any information about \(p\) in addition to that contained in the observed value of \(Y\)?

We can answer this question by computing the conditional probability of the sample given the observed value of \(Y\):

\[\begin{align*} \mathbb{P}\left(X_1=x_1,\ldots, X_n=x_n|Y=y\right)= \begin{cases} \frac{\mathbb{P}(X_1=x_1,\ldots, X_n=x_n)}{\mathbb{P}(Y=y)} & \text{if} \ \sum_{i=1}^n x_i=y,\\ 0 & \text{if} \ \sum_{i=1}^n x_i\neq y. \end{cases} \end{align*}\]

For the case \(\sum_{i=1}^n x_i=y,\) the above probability is given by

\[\begin{align*} \mathbb{P}\left(X_1=x_1,\ldots, X_n=x_n|Y=y\right) = \frac{p^y(1-p)^{n-y}} {\binom{n}{y}p^y(1-p)^{n-y}}=\frac{1}{\binom{n}{y}}, \end{align*}\]

which does not depend on \(p.\) This means that, once the number of total successes \(Y\) is known, there is no useful information left in the sample about the probability of success \(p.\) In this case, the information that remains in the sample is only about the order of appearance of the successes, which is superfluous for estimating \(p\) because trials are independent.

Definition 3.8 (Sufficient statistic) A statistic \(T=T(X_1,\ldots,X_n)\) is sufficient for \(\theta\) if the distribution of the sample given \(T,\) that is, the distribution of \((X_1,\ldots,X_n)|T,\) does not depend on \(\theta.\)

Remark. Observe that the previous definition implies that if \(T\) is a sufficient statistic for a parameter \(\theta,\) then \(T\) is also sufficient for any parameter \(g(\theta)\) that is function of \(\theta.\)36

Given a realization of a srs \((X_1,\ldots,X_n)\) of a rv \(X\) with distribution \(F(\cdot;\theta)\) depending on an unknown parameter \(\theta,\) the likelihood of a value of \(\theta\) represents the credibility that the sample gives to that value of \(\theta.\) The likelihood is one of the most important concepts in statistics and statistical inference — it is the blood of many inferential tools with excellent properties. The likelihood is defined through the joint pmf for discrete rv’s or through the joint pdf for continuous rv’s.

Definition 3.9 (Likelihood) Let \((X_1,\ldots,X_n)\) be a srs of a rv \(X\sim F(\cdot;\theta).\) Let \(x_1,\ldots,x_n\) be a realization of the srs. If the rv’s are discrete, the likelihood of \(\theta\) for \((x_1,\ldots,x_n)\) is defined as the joint pmf of the srs evaluated at \((x_1,\ldots,x_n)\):

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n):=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)=p_{(X_1\ldots,X_n)}(x_1,\ldots,x_n;\theta). \end{align*}\]

If the rv’s are continuous, the likelihood is defined as the joint pdf of the srs evaluated at \((x_1,\ldots,x_n)\):

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n):=f_{(X_1\ldots,X_n)}(x_1,\ldots,x_n;\theta). \end{align*}\]

\(\mathcal{L}(\theta;x_1,\ldots,x_n)\) is usually regarded as a function of the parameter \(\theta,\) since the realization of the sample \((x_1,\ldots,x_n)\) is fixed. In the situation in which the rv’s are independent, the likelihood is

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n)=\begin{cases} \prod_{i=1}^n p_{X_i}(x_i;\theta) & \text{if the rv's are discrete}, \\ \prod_{i=1}^n f_{X_i}(x_i;\theta) & \text{if the rv's are continuous}. \end{cases} \end{align*}\]

The following theorem gives a simple method for checking whether a statistic is sufficient in terms of the likelihood.

Theorem 3.5 (Factorization criterion) Let \((X_1,\ldots,X_n)\) be a srs of a rv \(X\sim F(\cdot;\theta).\) The statistic \(T=T(X_1,\ldots,X_n)\) is sufficient for \(\theta\) if and only if the likelihood can be factorized in two non-negative functions of the form

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n)=g(t,\theta)h(x_1,\ldots,x_n), \end{align*}\]

where \(g(t,\theta)\) only depends on the sample through \(t=T(x_1,\ldots,x_n)\) and \(h(x_1,\ldots,x_n)\) does not depend on \(\theta.\)

Proof (Proof of Theorem 3.5). We only prove the result for discrete rv’s.

Proof of “\(\Longrightarrow\)”. Let \(t=T(x_1,\ldots,x_n)\) be the observed value of the statistic for the sample \((x_1,\ldots,x_n).\) Since \(T\) is sufficient, \(\mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t)\) is independent of \(\theta\) and therefore the likelihood can be factorized as37

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n) &=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)\\ &=\mathbb{P}\left(\{X_1=x_1,\ldots,X_n=x_n\} \cap\{T=t\};\theta\right) \\ &=p(T=t;\theta)\mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t), \end{align*}\]

which agrees with the desired factorization just by taking

\[\begin{align*} g(t,\theta)=p(T=t;\theta), \quad h(x_1,\ldots,x_n)=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t). \end{align*}\]

Proof of “\(\Longleftarrow\)”. Assume now that the factorization

\[\begin{align*} \mathcal{L}(\theta;x_1,\ldots,x_n)=\mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)=g(t,\theta)h(x_1,\ldots,x_n) \end{align*}\]

holds. We define the set

\[\begin{align*} A_t=\left\{(x_1,\ldots,x_n)\in\mathbb{R}^n:T(x_1,\ldots,x_n)=t\right\}. \end{align*}\]


\[\begin{align*} p(T=t;\theta)&=\sum_{(x_1,\ldots,x_n)\in A_t} \mathbb{P}(X_1=x_1,\ldots,X_n=x_n;\theta)\\ &=g(t,\theta)\sum_{(x_1,\ldots,x_n)\in A_t}h(x_1,\ldots,x_n), \end{align*}\]

so therefore

\[\begin{align*} \mathbb{P}(X_1=x_1,\ldots,X_n=x_n|T=t;\theta)=\begin{cases} \frac{h(x_1,\ldots,x_n)}{\sum_{(x_1,\ldots,x_n)\in A_t}h(x_1,\ldots,x_n)} & \text{if} \ T(x_1,\ldots,x_n)=t,\\ 0 & \text{if} \ T(x_1,\ldots,x_n)\neq t. \end{cases} \end{align*}\]

Since \(h(x_1,\ldots,x_n)\) does not depend on \(\theta,\) then the conditional distribution of \((X_1,\ldots,X_n)\) given \(T\) does not depend on \(\theta.\) Therefore, \(T\) is sufficient.

Example 3.20 In Example 3.19, prove that \(T=\sum_{i=1}^n X_i\) is sufficient for \(p\) using the factorization criterion.

The likelihood is

\[\begin{align*} \mathcal{L}(p;x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i;p)=\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}=p^{\sum_{i=1}^n x_i} (1-p)^{n-\sum_{i=1}^n x_i}=g(t,p) \end{align*}\]

with \(t=\sum_{i=1}^n x_i\) and \(h(x_1,\ldots,x_n)=1.\) Therefore, \(T=\sum_{i=1}^n X_i\) is sufficient for \(p.\)

Example 3.21 Let \((X_1,\ldots,X_n)\) be a srs of a rv \(X\sim\mathrm{Exp}(1/\alpha).\) Let us see that \(\bar{X}\) is sufficient for \(\alpha.\)


\[\begin{align*} \mathcal{L}(\alpha;x_1,\ldots,x_n)=\prod_{i=1}^n f(x_i;\alpha)=\prod_{i=1}^n \frac{e^{-x_i/\alpha}}{\alpha}=\frac{e^{-\sum_{i=1}^n x_i/\alpha}}{\alpha^n}=\frac{e^{-n\bar{x}/\alpha}}{\alpha^n}=g(t,\alpha) \end{align*}\]

with \(h(x_1,\ldots,x_n)=1.\) Then, \(g(t,\alpha)\) depends on the sample through \(t=\bar{x}.\) Therefore, \(T=\bar{X}\) is sufficient for \(\alpha.\)

Example 3.22 Let \((X_1,\ldots,X_n)\) be a srs of a \(\mathcal{U}(\theta_1,\theta_2)\) rv with \(\theta_1<\theta_2.\) Let us find a sufficient statistic for \((\theta_1,\theta_2).\)

The likelihood is

\[\begin{align*} \mathcal{L}(\theta_1,\theta_2;x_1,\ldots,x_n)=\frac{1}{(\theta_2-\theta_1)^n}, \quad \theta_1<x_1,\ldots,x_n<\theta_2. \end{align*}\]

Rewriting the likelihood in terms of indicator functions, we get the factorization

\[\begin{align*} \mathcal{L}(\theta_1,\theta_2;x_1,\ldots,x_n)=\frac{1}{(\theta_2-\theta_1)^n}\, 1_{\{x_{(1)}>\theta_1\}}1_{\{x_{(n)}<\theta_2\}}=g(t,\theta_1,\theta_2) \end{align*}\]

by taking \(h(x_1,\ldots,x_n)=1.\) Since \(g(t,\theta_1,\theta_2)\) depends on the sample through \((x_{(1)},x_{(n)}),\) then the statistic

\[\begin{align*} T=(X_{(1)},X_{(n)}) \end{align*}\]

is sufficient for \((\theta_1,\theta_2).\)

However, if \(\theta_1\) was known, then the factorization would be

\[\begin{align*} g(t,\theta_2)=\frac{1}{(\theta_2-\theta_1)^n}\, 1_{\{x_{(n)}<\theta_2\}}, \ h(x_1,\ldots,x_n)=1_{\{x_{(1)}>\theta_1\}} \end{align*}\]

and in this case \(T(X_1,\ldots,X_n)=X_{(n)}\) is a sufficient statistic for \(\theta_2.\)

Example 3.23 Let \((X_1,\ldots,X_n)\) be a srs of a rv with \(\mathcal{N}(\mu,\sigma^2)\) distribution. Let us find sufficient statistics for:

  1. \(\sigma^2,\) if \(\mu\) is known.
  2. \(\mu,\) if \(\sigma^2\) is known.
  3. \(\mu\) and \(\sigma^2.\)

The likelihood of the sample is

\[\begin{align*} \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=\frac{1}{(\sqrt{2\pi\sigma^2})^n} \exp\left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2\right\}. \end{align*}\]

We already have the adequate factorization for part a taking \(h(x_1,\ldots,x_n)=1.\) Therefore, \(T=\sum_{i=1}^n (x_i-\mu)^2\) is sufficient for \(\sigma^2.\)

For part b, adding and subtracting \(\bar{x}\) inside the exponential, we obtain

\[\begin{align*} \sum_{i=1}^n (x_i-\mu)^2=\sum_{i=1}^n (x_i-\bar{x})^2+n(\bar{x}-\mu)^2, \end{align*}\]

since the crossed term vanishes. Then, the likelihood can be factorized in the form of

\[\begin{align*} \mathcal{L}(\mu,\sigma^2;x_1,\ldots,x_n)=\frac{1}{(\sqrt{2\pi\sigma^2})^n} \exp\left\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i-\bar{x})^2\right\}\exp\left\{-\frac{n}{2\sigma^2} (\bar{x}-\mu)^2\right\}. \end{align*}\]

Therefore, in this case \(T=\bar{X}\) is sufficient for \(\mu.\)

Finally, a sufficient statistic for \((\mu,\sigma^2)\) follows from the same factorization:

\[\begin{align*} T=\left(\bar{X},\sum_{i=1}^n (X_i-\bar{X})^2\right), \end{align*}\]

or, equivalently, multiplying and dividing by \(n\) inside of the first exponential, it follows that

\[\begin{align*} T=(\bar{X},S^2) \end{align*}\]

is sufficient for \((\mu,\sigma^2).\)

The previous example illustrates very well one of the main practical advantages of a sufficient statistic: it is only required to store \(T=(\bar{X},S^2),\) not the whole sample, to estimate \((\mu,\sigma^2).\) This is an immense advantage when dealing with big data: just two numbers, given by \(T=(\bar{X},S^2),\) can summarize all the relevant information of an arbitrarily large sample when the objective is to estimate a normal distribution. Furthermore, \(T=(\bar{X},S^2)\) is easily computable on an online basis that does not require reading all the sample at the same time:

\[\begin{align*} \bar{X}_{n+1}&=\frac{1}{n+1}\sum_{i=1}^{n+1} X_i=\frac{1}{n+1}(n\bar{X}_n+X_{n+1}),\\ S^2_{n+1}&=\bar{X^2}_{n+1}-\bar{X}_{n+1}^2=\frac{1}{n+1}\left\{(n\bar{X^2}_n+X^2_{n+1})-\frac{1}{(n+1)}\left(n\bar{X}_n+X_{n+1}\right)^2\right\}. \end{align*}\]

Hence, \(\bar{X}_{n+1}\) and \(\bar{X^2}_{n+1}\) can be obtained from the new observation \(X_{n+1}\) together with the previously computed \(\bar{X}_n\) and \(\bar{X^2}_n.\) We do not need to store all the sample!

  1. If there is no dependence on \(\theta,\) neither it is on \(g(\theta).\)↩︎

  2. Note that \(\{X_1=x_1,\ldots,X_n=x_n\}=\{X_1=x_1,\ldots,X_n=x_n\} \cap\{T=t\}\) because \(t=T(x_1,\ldots,x_n).\) The intersection does not bring a restriction, but it is useful to make it explicit.↩︎