Chapter 7 Likelihood Principle (Lecture on 01/23/2020)

The main consideration in this chapter is that, if certain other principles are accepted, the likelihood function must be used as a data reduction device.

Definition 7.1 (likelihood function) Let \(f(\mathbf{x}|\theta)\) denote the joint pdf or pmf of the sample \(\mathbf{X}=(\mathbf{X}_1,\cdots,\mathbf{X}_n)\). Then given that \(\mathbf{X}=\mathbf{x}\) is observed, the function of \(\theta\) defined by \(L(\theta|\mathbf{x})=f(\mathbf{x}|\theta)\) is called the likelihood function.
Comparison of the likelihood function gives an approximate comparison of the probability of the observed sample value \(\mathbf{x}\) under possible values of \(\theta\).
Definition 7.2 (Likelihood Principle) If \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points such that \(L(\theta|\mathbf{x})\) is proportional to \(L(\theta|\mathbf{y})\), that is, there exists a constant \(C(\mathbf{x},\mathbf{y})\) such that \[\begin{equation} L(\theta|\mathbf{x})=L(\theta|\mathbf{y})C(\mathbf{x},\mathbf{y}),\forall\theta \tag{7.1} \end{equation}\] then the conclusion drawn from \(\mathbf{x}\) and \(\mathbf{y}\) should be identical.

The rational behind likelihood principle is: the likelihood function is used to compare the plausibility of various parameter values, and if \(L(\theta_1|\mathbf{x})=2L(\theta_2|\mathbf{x})\), then in some sense \(\theta_1\) is twice as plausible as \(\theta_2\). If (7.1) also holds, then \(L(\theta_1|\mathbf{y})=2L(\theta_2|\mathbf{y})\). Thus, \(\mathbf{x}\) and \(\mathbf{y}\) provide same information about \(\theta\).

The likelihood function is a pdf of \(\mathbf{x}\), but there is no guarantee that it will be a pdf as a function of \(\theta\).
Example 7.1 (Likelihood Principal for Normal Distribution) Let \(X_1,\cdots,X_n\) and \(Y_1,\cdots,Y_n\) be random samples from \(N(\mu,\sigma^2)\) with known \(\sigma^2\). The sample mean is \(\bar{x}\) and \(\bar{y}\), respectively. Then if we choose \[\begin{equation} C(\mathbf{x},\mathbf{y})=exp(-\sum_{i=1}^n(x_i-\bar{x})^2/(2\sigma^2)+\sum_{i=1}^n(y_i-\bar{y})^2/(2\sigma^2)) \tag{7.2} \end{equation}\] Then the likelihood principle defined by (7.1) is satisfied if and only if \(\bar{x}=\bar{y}\). Thus, the likelihood principle states that the same conclusion about \(\mu\) should be drawn for any two sample points satisfying \(\bar{x}=\bar{y}\).

Formally, we define an experiment \(E\) to be a triple \((\mathbf{X},\theta,\{f(\mathbf{x}|\theta)\})\), where \(\mathbf{X}\) is a random vector with pmf \(f(\mathbf{x}|\theta)\) for some \(\theta\) in parameter space \(\Theta\). An experimenter knonwing what experiment \(E\) was performed and having observed a particular sample \(\mathbf{X}=\mathbf{x}\), will draw some conclusion about \(\theta\). This conclusion denote by \(Ev(E,\mathbf{x})\), stands for the evidence about \(\theta\) arising from \(E\) and \(\mathbf{x}\).

Example 7.2 (Evidence Function) Let \(E\) be the experiment consisting of observing \(X_1,\cdots,X_n\) i.i.d. from \(N(\mu,\sigma^2)\) with known \(\sigma^2\). We use mean \(\bar{x}\) as an estimate of \(\mu\). To give the accuracy of this estimate, it is common to report the standard deviation \(\sigma/\sqrt{n}\). Thus, we could define \(Ev(E,\mathbf{x})=(\bar{x},\sigma/\sqrt{n})\). Here the \(\bar{x}\) coordinate depends on the observed sample \(\mathbf{x}\), while the \(\sigma/\sqrt{n}\) coordinate depends on the knowledge of \(E\).
Definition 7.3 (Formal Sufficiency Principle) Consider experiment \(E=(\mathbf{X},\theta,\{f(\mathbf{x}|\theta)\})\) and suppose \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\). If \(\mathbf{x}\) and \(\mathbf{y}\) are sample points satisfying \(T(\mathbf{x})=T(\mathbf{y})\), then \(Ev(E,\mathbf{x})=Ev(E,\mathbf{y})\).
This definition about sufficient principle is slightly further than the previous sufficient principle in Definition 3.2. There no mention was made of the experiment. Here we are agreeing to equate evidence if the sufficient statistics match.
Definition 7.4 (Conditionality Principle) Suppose that \(E_1=(\mathbf{X}_1,\theta,\{f_1(\mathbf{x}_1|\theta)\})\) and \(E_2=(\mathbf{X}_2,\theta,\{f_2(\mathbf{x}_2|\theta)\})\) are two experiments, where only the unknown parameter \(\theta\) need be common. Consider the mixed experiment in which the random variable \(J\) is observed, where \(P(J=1)=P(J=2)=\frac{1}{2}\), and then experiment \(E_J\) is performed. Formally, the experiment performed is \(E^*=(\mathbf{X}^*,\theta,\{f^*(\mathbf{x}^*|\theta)\})\), where \(\mathbf{X}^*=(j,\mathbf{X}_j)\) and \(f^*(\mathbf{x}^*|\theta)=f^*((j,\mathbf{x}_j)|\theta)=\frac{1}{2}f_j(\mathbf{x}_j|\theta)\). Then \[\begin{equation} Ev(E^*,(j,\mathbf{x}_j))=Ev(E_j,\mathbf{x}_j) \tag{7.3} \end{equation}\]
This principle simply says that if one of two experiments is randomly chosen and the chosen experiment is done, yielding data \(\mathbf{x}\), the information about \(\theta\) depends only on the experiment performed. The fact that this experiment was performed, ranther than the other, has not changed knowledge of \(\theta\).
Definition 7.5 (Formal Likelihood Principle) Suppose that we have two experiments, \(E_1=(\mathbf{X}_1,\theta,\{f_1(\mathbf{x}_1|\theta)\})\) and \(E_2=(\mathbf{X}_2,\theta,\{f_2(\mathbf{x}_2|\theta)\})\), where the unknown parameter \(\theta\) is the same in both experiments. Suppose \(\mathbf{x}_1^*\) and \(\mathbf{x}_2^*\) are sample points from \(E_1\) and \(E_2\), respectively, such that \[\begin{equation} L(\theta|\mathbf{x}_2^*)=CL(\theta|\mathbf{x}_1^*) \tag{7.4} \end{equation}\] for all \(\theta\) and for some constant \(C\) that may depend on \(\mathbf{x}_1^*\) and \(\mathbf{x}_2^*\) but not \(\theta\). Then \[\begin{equation} Ev(E_1,\mathbf{x}_1^*)=Ev(E_2,\mathbf{x}_2^*) \tag{7.5} \end{equation}\]
This definition about likelihood principle is slightly further than the likelihood principle in Definition 7.2, where it only concerns one experiment.

Corollary 7.1 (Likelihood Principle Corollary) Assume Formal Likelihood Principle, if \(E=(\mathbf{X},\theta,\{f(\mathbf{x}|\theta)\})\) is an experiment, then \(Ev(E,\mathbf{x})\) should depend on \(E\) and \(\mathbf{x}\) only through \(L(\theta|\mathbf{x})\).

(This is Exercise 6.32 on Casella and Berger (2002))
Proof. For \(\mathbf{x}^*\in\mathcal{X}\), define \[\begin{equation} Y=\left\{\begin{aligned} & 1 & \mathbf{x}=\mathbf{x}^*\\& 0 & \mathbf{x}\neq\mathbf{x}^*\end{aligned}\right. \tag{7.6} \end{equation}\] Y has distribution given by \[\begin{equation} f(Y=1|\theta)=f(\mathbf{x}^*|\theta)=1-f(Y=0|\theta) \tag{7.7} \end{equation}\] For the experiment \(E^*\) of observing \(Y\), by Likelihood Principle, \[\begin{equation} Ev(E,\mathbf{x}^*)=Ev(E^*,1) \tag{7.8} \end{equation}\] Hence using (7.7), \(Ev(E^*,1)\) depends only on \(f(\mathbf{x}^*|\theta)=L(\theta|\mathbf{x}^*)\).

Theorem 7.1 (Birnbaum Theorem) The Formal Likelihood Principle is equivalent to the Formal Sufficiency Principle and the Conditionality Principle.

(This is Exercise 6.33 on Casella and Berger (2002))

Proof. Let \(E_1,E_2,\mathbf{x}^*_1\) and \(\mathbf{x}_2^*\) be as defined in the Formal Likelihood Principle, and let \(E^*\) be the mixed experiment from the Conditionality Principle. On the sample space of \(E^*\) define the statistic

\[\begin{equation} T(j,\mathbf{x}_j)=\left\{\begin{aligned} &(1,\mathbf{x}_1^*) & j=1,\mathbf{x}_1=\mathbf{x}^*_1\,or\, j=2,\mathbf{x}_2=\mathbf{x}^*_2\\ & (j,\mathbf{x}_j) & otherwise \end{aligned} \right. \tag{7.9} \end{equation}\]

Thus the two outcomes \((1,\mathbf{x}_1^*)\) and \((2,\mathbf{x}_2^*)\) result in the same value of T. \(T(J,\mathbf{X}_J)\) is a sufficient statistic for \(\theta\) in experiment \(E^*\), since \[\begin{equation} P(\mathbf{X}^*=(j,x_j)|T=t\neq (1,\mathbf{x}_1^*))=\left\{\begin{aligned}& 1 & \quad (j,x_j)=t \\& 0 & \quad otherwise \end{aligned}\right. \tag{7.10} \end{equation}\]

and \[\begin{equation} \begin{split} P(\mathbf{X}^*=(1,\mathbf{x}_1^*)|T=(1,\mathbf{x}_1^*))&= 1-P(\mathbf{X}^*=(2,\mathbf{x}_2^*)|T=(1,\mathbf{x}_1^*))\\ &=\frac{\frac{1}{2}f_1(\mathbf{x}_1^*|\theta)}{\frac{1}{2}f_1(\mathbf{x}_1^*|\theta)+\frac{1}{2}f_2(\mathbf{x}_2^*|\theta)}\\ &=\frac{C}{1+C} \end{split} \tag{7.11} \end{equation}\] where the last equation holds by the assumption in the Formal Likelihood Principle. Since (7.10) and (7.11) are all independent of \(\theta\), \(T(J,\mathbf{X}_J)\) is a sufficient statistic for \(\theta\). Then from Formal Sufficient Principle, \[\begin{equation} Ev(E^*,(1,\mathbf{x}_1^*))=Ev(E^*,(2,\mathbf{x}_2^*)) \tag{7.12} \end{equation}\] And from the Conditionality Principle, \[\begin{equation} \begin{split} &Ev(E^*,(1,\mathbf{x}_1^*))=Ev(E^*,\mathbf{x}_1^*)\\ &Ev(E^*,(2,\mathbf{x}_2^*))=Ev(E^*,\mathbf{x}_2^*) \end{split} \tag{7.13} \end{equation}\] Combine (7.12) and (7.13) we have the Formal Likelihood Principle.

For the other direction, for \(E^*\), \(L(\theta|(j,\mathbf{x}_j))=\frac{1}{2}f_j(\mathbf{x}_j|theta)\). This is clearly proportional to \(f_j(\mathbf{x}_j|theta)\), which is also the likelihood function in \(E_j\) when \(\mathbf{x}_j\) is observed. So the Formal Likelihood Principle implies that \[\begin{equation} Ev(E^*,(j,\mathbf{x}_j))=Ev(E_j,x_j) \tag{7.14} \end{equation}\] which is the Conditionality Principle. Then if \(T(\mathbf{X})\) is sufficient and \(T(\mathbf{x})=T(\mathbf{y})\), the likelihoods are proportional (by Factorization Theorem) and the Formal Likelihood Principle implies that \(Ev(E,\mathbf{x})=Ev(E,\mathbf{y})\), which is just the Formal Sufficiency Principle.
There are common statistical procedures violate the Formal Likelihood Principle.Formal Sufficient Principle is model-dependent. Belief in this principle necessitates belief in the model, which may not be easy to do becasue model checking may not be based on sufficient statistic, for example, model checking based on residuals. The Former Likelihood Principle is violated becasue in the proof of it, the Sufficient Principle is applied in ignorance of the Conditionality Principle. The sufficient statistic \(T(J,\mathbf{X}_J)\) is defined for the mixture experiment. If Conditionality Principle was invoked first, then separate sufficient statistics would have to be defined for each experiment, which will cause that \(T(J,\mathbf{X}_J)\) can not take on the same value for sample points from each experiment.

References

Casella, George, and Roger Berger. 2002. Statistical Inference. 2nd ed. Belmont, CA: Duxbury Resource Center.