2.1 Past FYE Problems

Exercise 2.1 (By Bruno, FYE 2020) Consider a random sample \(X_i\), \(i=1,\cdots,n\), from the \(Unif(0,\theta)\) distribution, with unknown \(\theta>0\).

  • (i). (10%) Derive the likelihood of \(\theta\) and find the MLE, \(\hat{\theta}\).

  • (ii). (15%) Show that \(\hat{\theta}\) is a function of a minimal sufficient statistic.

  • (iii). (20%) Find the distribution of \(\hat{\theta}\).

  • (iv). (30%) Find a pivot \(T(\theta,\hat{\theta})\) bassed on the CDF of \(\hat{\theta}\) and use the equation \[\begin{equation} Pr(a\leq T(\theta,\hat{\theta})\leq b)=0.95,\quad a,b\in[0,1] \end{equation}\] to find a family of \(95\%\) confidence interval of \(\hat{\theta}\).

  • (v). (25%) Find the value of a that produces the shortest interval in the confidence interval family in part (iv).

Proof. (i). Since the p.d.f. of \(X_i\) is \[\begin{equation} f(x_i)=\left\{\begin{aligned} &\frac{1}{\theta} &\quad x_i\in(0,\theta)\\ & 0 & \quad o.w. \end{aligned}\right. \tag{2.1} \end{equation}\] We have the likelihood as \[\begin{equation} L(\theta|\mathbf{x})=\prod_{i=1}^n\frac{1}{\theta}\mathbf{1}_{x_i\in(0,\theta)}=\theta^{-n}\mathbf{1}_{\min_i x_i>0}\mathbf{1}_{\max_i x_i<\theta} \tag{2.2} \end{equation}\] Since as a function of \(\theta\), \(L(\theta|\mathbf{x})\) is a power function with negative power, it is monotonically decreasing. It achieves the maximum ate the minimal possible value of \(\theta\), which is \(\max_i x_i\). Thus, \(\hat{\theta}=\max_i x_i\).

(ii). To show the sufficiency, from (i) we have \(L(\theta|\mathbf{x})=g(\hat{\theta}|\theta)h(x)\). Thus, by factorization theorem, we have that \(\max_i x_i\) is sufficient.

Now, suppose another data \(\mathbf{Y}\) with \(\max_i y_i\), the inference of \(\theta\) is only based on \(\theta^{-n}\mathbf{1}_{\max_ix_i<\theta}\). The ratio \(\frac{\theta^{-n}\mathbf{1}_{\max_ix_i<\theta}}{\theta^{-n}\mathbf{1}_{\max_iy_i<\theta}}\) is a constant w.r.t. \(\theta\) if and only if \(\max_ix_i=\max_iy_i\). Thus, \(\max_ix_i\) is minimaal sufficient.

(iii). Denote the c.d.f. of \(\hat{\theta}\) as G and p.d.f. as g, the c.d.f. of \(X_i\) as F and p.d.f. as \(f\), we have \(G(\hat{\theta})=(F(\hat{\theta}))^n\) and \(g(\hat{\theta})=n(F(\hat{\theta}))^{n-1}f(\hat{\theta})\). Since \(F(x)=\frac{x}{\theta}\) and \(f(x)=\frac{1}{\theta}\), the p.d.f. of \(\hat{\theta}\) is then \[\begin{equation} g_{\hat{\theta}}(t)=\left\{\begin{aligned} &n\frac{t^{n-1}}{\theta^n} &\quad t>\max_ix_i\\ & 0 & \quad o.w. \end{aligned}\right. \tag{2.3} \end{equation}\]

(iv). The distribution of \(X\) is a scale family distribution. Define \(Q(x_1,\cdots,x_n,\theta)=\frac{\max_ix_i}{\theta}\) and we shall prove \(Q\) is a pivot. By definition, we need to show the distribution of \(Q\) does not depend on \(\theta\).

Since \(\theta>\max_ix_i\), \(Q\in(0,1)\). The c.d.f. of \(Q\) is
\[\begin{equation} F_Q(q)=P(\frac{\max_ix_i}{\theta}\leq q)=Pr(x_1\leq q\theta)\cdots Pr(x_n\leq q\theta)=\prod_{i=1}^b\frac{q\theta}{\theta}=q^n \tag{2.4} \end{equation}\] Hence, the p.d.f. \(f_Q(q)=nq^{n-1}\) does not depend on \(\theta\). \(\frac{\max_ix_i}{\theta}\) is actually a pivot. Thus, \[\begin{equation} Pr(a\leq T(\theta,\hat{\theta})\leq b)=F_Q(b)-F_Q(a)=b^n-a^n \tag{2.5} \end{equation}\] For any \(a,b\) satisfies \(b^n-a^n=0.95\), \((a\theta,b\theta)\) is a \(95\%\) confidence interval of \(\hat{\theta}=\max_ix_i\).

(v). From \(b^n-a^n=0.95\) we have \(b=(0.95+a^n)^{\frac{1}{n}}\). The length of the interval is therefore \(L(a)=[(0.95+a^n)^{\frac{1}{n}}-a]\theta\). Define \(l(a)=(0.95+a^n)^{\frac{1}{n}}-a\), since \(\theta>0\), minimize \(L(a)\) is equivalent to minimize \(l(a)\). Taking derivatives, \[\begin{equation} l^{\prime}(a)=(\frac{(0.95+a^n)^{\frac{1}{n}}}{a})^{1-n}-1 \tag{2.6} \end{equation}\] Since \(0.95>0\), \((0.95+a^n)^{\frac{1}{n}}>a\) and therefore \(l^{\prime}(a)<0\). \(l(a)\) is a decreasing function of \(a\), which achieve the minimal at the maximum value that \(a\) can take.

Since \(b=(0.95+a^n)^{\frac{1}{n}}\in[0,1]\), as a function of \(a\), \(b\) is increasing, which achieves maximum value 1 at the maximum of \(a\). Denote \(\max_a a\) as \(\hat{a}\), we have \((0.95+\hat{a}^n)^{\frac{1}{n}}=1\) which implies \(\hat{a}=0.05^{\frac{1}{n}}\).

Thus, the value \(a\) that achieves the shortest confidence interval is \((0.95+a^n)^{\frac{1}{n}}\).

Exercise 2.2 (By Rajarshi, FYE 2018) Let \(X_1,\cdots,X_n\stackrel{i.i.d.}{\sim}U(-\theta,theta)\), where \(\theta>0\) is an one-dimensional unknown parameter.

  1. (30%) Is \((X_{(1)},\cdots,X_{(n)})\) minimal sufficient for \(\theta\)? Provide an argument supporting your answer.

  2. (20%) Let \(h(X_1,\cdots,X_n)=max{|X_1|,\cdots,|X_n|}\). Prove that \(h(X_1,\cdots,X_n)\) is a complete statistic for \(\theta\).

  3. (20%) Show that the conditional distribution of \(h(X_1,\cdots,X_n)\) given \(|X_1|/|X_3|\) is the same as the marginal distribution of \(h(X_1,\cdots,X_n)\).

  4. (30%) Provide the likelihood ratio test statistic for testing \(H_0:\theta=2\) versus \(H_1:\theta\neq 2\).

Exercise 2.3 (By Rajarshi, FYE 2015) Suppose \(X_1,\cdots,X_n\stackrel{i.i.d.}{\sim}N(\mu_1,\sigma_1^2)\) and \(Y_1,\cdots,Y_m\stackrel{i.i.d.}{\sim}N(\mu_2,\sigma_2^2)\), with \(\mu_1,\mu_2\in\mathbb{R}\) and \(\sigma_1,\sigma_2>0\). Moreover, assume that \(\{X_1,\cdots,X_n\}\) and \(\{Y_1,\cdots,Y_m\}\) are mutually independent.

(a). (30%) Derive a set of minimal sufficient statistics for \((\mu_1,\mu_2,\sigma_1,\sigma_2)\). Is this collection of statistics complete under the additional assumption that \(\mu_1=\mu_2\)? Justify your answer.

(b). (40%) Find the UMVUE for \(\theta=(\mu_1,\mu_2)/(\sigma_1^r\sigma_2^s)\), where \(1<r<n-1\) and \(1<s<m-1\). Show your work. You may use without proof the following fact: if \(Z\sim IG(\alpha,\beta)\) for \(\alpha>\eta\geq 1\) and \(\beta>0\), in the parameterization in which \(E(Z)=\beta/(\alpha-1)\), then \[\begin{equation} E(Z^{\eta})=\frac{\beta^{\eta}\Gamma(\alpha-\eta)}{\Gamma(\alpha)} \tag{2.7} \end{equation}\]

(c). (30%) Now assume that \(\sigma^2_1\) and \(\sigma_2^2\) are known, \(\sigma_1^2=\sigma_2^2\equiv\sigma^2\), and \(\mu_1=\mu_2\equiv\mu\). Explain why a UMP level \(\alpha\) test for testing \(H_0: \mu\leq\mu_0\) versus \(H_1: \mu>\mu_0\) exists, and specify this test explicitly.

Exercise 2.4 (By Rajarshi, FYE 2015 Retake) Suppose that \(X_1,\cdots,X_n\stackrel{i.i.d.}{\sim}Bernoulli(p_1)\) and \(Y_1,\cdots,Y_m\stackrel{i.i.d.}{\sim}Bernoulli(p_2)\), where \(n\geq 3, m\geq 4\), and \(p_1,\cdots,p_2\in(0,1)\). Moreover, assume that the \(X_i\) and \(Y_j\) are mutually independent.

(a). (30%) Derive a set of complete sufficient statistics for \((p_1,p_2)\). Is this collection of statistics complete under the additional assumption that \(p_1=p_2\)? Justify your answer.

(b). (40%) Find the UMVUE for \(h_1(p_1)h_2(p_2)\), where \(h_1(p_1)=P_{p_1}(\sum_{i=1}^{n-2}X_i>X_{n-1}+X_n)\) and \(h_2(p_2)=P_{p_2}(\sum_{j=1}^{m-2}Y_j>Y_{m-1}+Y_m)\). Hint: Start with an unbiased estimator \(I(\sum_{i=1}^{n-2}X_i>X_{n-1}+X_n)\) of \(h_1(p_1)\), where \(I(\cdot)\) is the indicator function. You may find Rao-Blackwell and Lehman-Scheffe theorems useful.

(c). (30%) Consider two different hypotheses tests: (i) \(H_{0A}: p_1=\frac{1}{2}\) vs. \(H_{1A}: p_1\neq \frac{1}{2}\); and (ii) \(H_{0B}: p_2=\frac{1}{3}\) vs. \(H_{1B}: p_2\neq \frac{1}{3}\). Let \[\begin{equation} \phi_1(X_1,\cdots,X_n)=\left\{\begin{aligned} &1 &\quad X_1+X_2+X_3>1\\ & 0 & o.w. \end{aligned} \right. \end{equation}\] \[\begin{equation} \phi_2(Y_1,\cdots,Y_m)=\left\{\begin{aligned} &1 &\quad Y_1+Y_2+Y_3+Y_4>C\\ & 0 & o.w. \end{aligned} \right. \tag{2.8} \end{equation}\]
where \(\phi_1\) and \(\phi_2\) are the test functions for hypothesis (i) and (ii), respectively. Let \(\alpha\) be the size of \(\phi_1\). Compute \(\alpha\) and determine \(C\) so that \(\phi_2\) has level \(\alpha\). You do not need to simplify, even an equation with \(C\) as the only variable should be enough.

Proof. (a). The joint density for \(X_1,\cdots,X_n\) and \(Y_1,\cdots,Y_m\) is given by \[\begin{equation} f(Y_1,\cdots,Y_m,X_1,\cdots,X_n|p_1,p_2)=p_1^{\sum_{i=1}^nX_i}(1-p_1)^{n-\sum_{i=1}^nX_i}p_2^{\sum_{j=1}^mY_j}(1-p_2)^{m-\sum_{j=1}^mY_j} \tag{2.9} \end{equation}\] Therefore, for any other realization \(\mathbf{U}=(U_1,\cdots,U_n)\) of \(\mathbf{X}\) and \(\mathbf{V}=(V_1,\cdots,V_m)\) of \(\mathbf{Y}\), \[\begin{equation} \frac{f(\mathbf{X},\mathbf{Y}|p_1,p_2)}{f(\mathbf{U},\mathbf{V}|p_1,p_2)}=(\frac{p_1}{1-p_1})^{\sum_{i=1}^nX_i-\sum_{i=1}^nU_i}(\frac{p_2}{1-p_2})^{\sum_{j=1}^mY_j-\sum_{j=1}^mV_j} \tag{2.10} \end{equation}\] This ratio is a constant function of the parameter vector if and only if \(\sum_{i=1}^nX_i=\sum_{i=1}^nU_i\) and \(\sum_{j=1}^mY_j=\sum_{j=1}^mV_j\). Hence a minimal sufficient statistic is \((\sum_{i=1}^nX_i,\sum_{j=1}^mY_j)\). Since, this is a two parameter exponential family and the parameter space contains an open rectangle in \((0,1)\times(0,1)\), the minimal sufficient statistic is also complete.

Note that, under the assumption that \(p_1=p_2\), \(E(m\sum_{i=1}^nX_i−n\sum_{j=1}^mY_j)=0\), i.e. there exists a nonzero function of \((\sum_{i=1}^nX_i,\sum_{j=1}^mY_j)\) whose expectation is zero for all parameters. Therefore, \((\sum_{i=1}^nX_i,\sum_{j=1}^mY_j)\) is not complete sufficient under the additional assumption.

(b). Define, \[\begin{equation} H(\mathbf{X})=\left\{\begin{aligned} & 1 & \sum_{i=1}^{n-2}X_i>X_{n-1}+X_n\\ & 0 & o.w. \end{aligned}\right. \tag{2.11} \end{equation}\] Clearly, \(E(H(\mathbf{X}))=h_1(p_1)\), i.e. \(H(\mathbf{X})\) is an unbiased estimator of \(h_1(p_1)\). We will use Rao-Blackwell theorem to find a better estimator than \(H(\mathbf{X})\) as following. \[\begin{equation} \begin{split} E(H(\mathbf{X})&|\sum_{i=1}^nX_i=t)=P(\sum_{i=1}^{n-2}X_i<X_{n-1}+X_n|\sum_{i=1}^nX_i=t)\\ &=\frac{P(\sum_{i=1}^{n-2}X_i=t-1,X_n=0,X_{n-1}=1)+P(\sum_{i=1}^{n-2}X_i=t-1,X_n=1,X_{n-1}=0)}{P(\sum_{i=1}^nX_i=t)}\\ &+\frac{P(\sum_{i=1}^{n-2}X_i=t,X_n=0,X_{n-1}=0)+P(\sum_{i=1}^{n-2}X_i=t-2,X_n=1,X_{n-1}=1)}{P(\sum_{i=1}^nX_i=t)}\\ &=\frac{2{{n-2} \choose {t-1}}+{{n-2} \choose {t}}+{{n-2} \choose {t-2}}}{{n \choose t}} \end{split} \tag{2.12} \end{equation}\] By Rao-Blackwell theorem \(F_1(\mathbf{X})=\frac{2{{n-2} \choose {\sum_{i=1}^nX_i-1}}+{{n-2} \choose {\sum_{i=1}^nX_i}}+{{n-2} \choose {\sum_{i=1}^nX_i-2}}}{{n \choose {\sum_{i=1}^nX_i}}}\) is an unbiased estimator of \(h_1(p_1)\). Since this estimator is a function of the complete sufficient statistics, it is the UMVUE of \(h_1(p_1)\). Similarly, the UMVUE for \(h_2(p_2)\) is \(F_2(\mathbf{Y})=\frac{2{{m-2} \choose {\sum_{j=1}^mY_i-1}}+{{m-2} \choose {\sum_{j=1}^mY_j}}+{{m-2} \choose {\sum_{j=1}^mY_j-2}}}{{m \choose {\sum_{j=1}^mY_j}}}\). Using the independence of \(\mathbf{X}\) and \(\mathbf{Y}\), we conclude that \(F_1(\mathbf{X})F_2(\mathbf{Y})\) is the UMVUE for \(h_1(p_1)h_2(p_2)\).

(c). The size of \(\phi_1\) is \(E(\phi_1)=P(X_1+X_2+X_3>1)={3 \choose 2}(\frac{1}{2})^3+{3 \choose 3}(\frac{1}{2})^3=\frac{1}{2}\).

The size of \(\phi_2\) is given by \(E(\phi_2)=P(Y_1+Y_2+Y_3+Y_4>C)=\sum_{i=[C]+1}^4{4 \choose i}(\frac{1}{3})^{i}(\frac{2}{3})^{4-i}\), where \([C]\) is the greatest integer smaller than \(C\).

\(C\) can now be determined from the equation \(\sum_{i=[C]+1}^4{4 \choose i}(\frac{1}{3})^{i}(\frac{2}{3})^{4-i}<\frac{1}{2}\).

Exercise 2.5 (By Draper, FYE 2014) This problem concerns some things that can go wrong with frequentist inference when Your sample size is small.

You have a single observation \(Y\), which you know is drawn from the \(Beta(\theta,1)\) distribution, but \(\theta\in(0,1)\) is unknown to you; Your goal here is to construct good point and interval estimates of \(\theta\). Recall that in this sampling model (a) \(p(y|theta)=\theta y^{\theta−1}I(0<y<1)\), where \(I(A)=1\) if proposition A is true and 0 otherwise, and (b) the repeated-sampling mean of \(Y\) is \(E_{RS}(Y|\theta)=\frac{\theta}{\theta+1}\).

(1). (15%) Consider any estimator \(\hat{\theta}(y)\) of \(\theta\) in this model, where \(y\) is a realized value of \(Y\). Identify all of the following qualitative behaviors that would be desirable for \(\hat{\theta}(y)\) here, explaining briefly in each case, or explain briefly why none of them would be desirable (if none are). Hint: You may find it helpful to sketch the likelihood function (arising from this sampling model) for several values of \(y\).

  • (a). \(\lim_{y\downarrow 0}\hat{\theta}(y)=0\).

  • (b). \(\hat{\theta}(y)\) should be a non-decreasing function of \(y\).

  • (c). \(\lim_{y\uparrow 1}\hat{\theta}(y)=1\).

(2). (15%) Find the method-of-moments estimate \(\hat{\theta}_{MoM}\) of \(\theta\), and identify the complete set of observed values \(y\) of \(Y\) under which \(\hat{\theta}_{MoM}\) fails to respect the range restrictions on \(\theta\). Which, if any, of the desirable qualitative behaviors in (1) does \(\hat{\theta}_{MoM}\) exhibit? Explain briefly.

(3). (15%) Repeat all parts of (2) for the maximum-likelihood estimate \(\hat{\theta}_{MLE}\), showing that in this problem \(\hat{\theta}_{MLE}=\min(-\frac{1}{\log Y},1)\). Does this approach exhibit better qualitative performance than \(\hat{\theta}_{MoM}\) here? Explain briefly.

(4). (20%) Use observed information to construct an approximate 95% MLE-based confidence interval for \(\theta\), identifying the assumptions built into this method and commenting on whether they’re likely to be met in this case. Are there values of \(y\) for which this confidence interval can violate the range restrictions on \(\theta\)? If so, identify the relevant set of \(y\) values; if not, briefly explain why not.

(5). (35%) Show that in this model \(V=\frac{\theta}{-\frac{1}{\log Y}}=-\theta\log Y\) is a pivotal quantity, in this case with the standard exponential distribution \(p(v)=e^{-v}\), and use this to construct (for any given \(\alpha\in(0,1)\)) an exact \(100(1-\alpha)%\) confidence interval for \(\theta\) based on the single observation \(Y\). As a function of \(Y\), can anything go wrong with this interval with respect to range restrictions on \(\theta\)? Explain briefly.

Proof. (1). You can see, both from rough sketches of the likelihood function \(L(\theta|y)\) for several values of \(y\) and from the repeated-sampling mean, that all three of (a)–(c) are desirable qualitative behaviors for a good estimator in this problem.

(2). Solving \(E_{RS}(Y|\theta)=\frac{\theta}{\theta+1}=y\) for \(\theta\) yields \(\hat{\theta}_{MoM}=\frac{y}{1-y}\). This estimator behaves sensibly at \(y=0\) and is monotonically increasing in \(y\), thus properties (1)(a) and (1)(b) are satisfied, but \(\hat{\theta}_{MoM}\) goes to \(\infty\) as \(y\uparrow 1\), and in fact \(\hat{\theta}_{MoM}\geq 1\) (thereby violating the basic range restriction for \(\theta\)) for all \(y\geq\frac{1}{2}\).

(3). The log-likelihood function is \(\ell{\theta|y}=\log(\theta)+(\theta-1)\log y\), from which \(\frac{\partial \ell{\theta|y}}{\partial\theta}=\frac{1}{\theta}+\log y\); This first partial derivative has a unique zero at \(-\frac{1}{\log y}\), which is a monotonically increasing function of \(y\). This is the global maximum of \(\ell{\theta|y}\) in \((0,1)\) as long as \(-\frac{1}{\log y}<1\), which is true only for \(0<y<\frac{1}{\epsilon}\approx 0.37\); For any \(y\geq \frac{1}{\epsilon}\), the maximum occurs at the boundary \(\theta=1\), so in that case \(\hat{\theta}_{MLE}=1\). Thus, \(\hat{\theta}_{MLE}=\min(-\frac{1}{\log y},1)\). In this setting \(\hat{\theta}_{MLE}\) satisfies all three of the desirable qualitative behaviors in (1).

(4). Evidently \(\frac{\partial^2 \ell{\theta|y}}{\partial\theta^2}=-\frac{1}{\theta^2}\), from which the observed information is \[\begin{equation} \hat{I}(\hat{\theta}_{MLE})=[-\frac{\partial^2}{\partial\theta^2}\ell{\theta|y}]_{\theta=\hat{\theta}_{MLE}}=(\log y)^2 \tag{2.13} \end{equation}\] for \(0<y<\frac{1}{\epsilon}\). For \(y\geq\frac{1}{\epsilon}, \hat{I}(\hat{\theta}_{MLE})=1\). For \(0<y<\frac{1}{\epsilon}\), the usual standard error (SE) associated with the MLE is then \(\hat{SE}(\hat{\theta}_{MLE})=\sqrt{\hat{I}^{-1}(\hat{\theta}_{MLE})}=\frac{1}{|\log y|}\), with associated approximate 95% confidence interval \(-\frac{1}{\log y}\pm \frac{1.96}{|\log y|}=\frac{1\pm 1.96}{|\log y|}\). For \(y\geq\frac{1}{\epsilon}\), the corresponding interval is \(1\pm 1.96\). This method assumes that the sample size \(n\) is big enough (a) for observed information to provide an accurate SE and (b) for the repeated-sampling distribution of the MLE (for the actual \(n\) in the problem under study) to be close to Gaussian. Here, with \(n=1\), neither of these assumptions is anywhere near correct. Moreover, you can readily see that (i) the left endpoint of the interval \(\frac{1\pm 1.96}{|\log y|}\) is guaranteed to be negative, (ii) the right endpoint of that interval will only be less than 1 iff \(\log y<-2.96\), i.e., iff \(y\) is less than about 0.05, and (iii) the interval \(1\pm 1.96\) runs from −0.96 to 2.96. Fisher did not have \(n=1\) in mind when he proposed this method.

(5). The fact that \(V=-\theta\log Y\) has the \(Exp(1)\) distribution can be shown directly using the change-of-variable formula. All of this means that in repeated sampling \[\begin{equation} 1-\alpha=P_{RS}[F_{\epsilon}^{-1}(\frac{\alpha}{2})<\theta\log Y<F_{\epsilon}^{-1}(1-\frac{\alpha}{2})] \tag{2.14} \end{equation}\] where \(F_{\epsilon}^{-1}(\cdot)\) is the inverse c.d.f. of the standard exponential distribution; after some rearrangement this becomes \[\begin{equation} P_{RS}[-\frac{F_{\epsilon}^{-1}(\frac{\alpha}{2})}{\log Y}<\theta<-\frac{F_{\epsilon}^{-1}(1-\frac{\alpha}{2})}{\log Y}]=1-\alpha \end{equation}\] so that \([-\frac{F_{\epsilon}^{-1}(\frac{\alpha}{2})}{\log Y},-\frac{F_{\epsilon}^{-1}(1-\frac{\alpha}{2})}{\log Y}]\) is a frequentist-valid \(100(1-\alpha)%\) confidence interval for \(\theta\) in this model. The left and right endpoints of this interval cannot go negative for any \(Y\in(0,1)\), which is the good news; but it is easy to show (based on the c.d.f. of the \(Exp(1)\) distribution, which is \(F_{\epsilon}^{-1}(p)=-\log(1-p)\)) that the right endpoint will be bigger than 1 whenever \(Y>\frac{\alpha}{2}\) and the left endpoint will also be bigger than 1 for all \(Y>1-\frac{\alpha}{2}\) . Valid and sensible frequentist inference is difficult when \(n\) is small.

Exercise 2.6 (By Draper, FYE 2014 Retake) For each statement below, say whether it is true or false; if true without further assumptions, briefly explain why it is true and describe its implications for statistical inference; if it is sometimes true, give the extra conditions necessary to make it true; if it is false, briefly explain how to change it so that it is true and/or give an example of why it is false.

  1. (25%) When the set \(\Theta\) of possible values of a (one-dimensional) real-valued parameter \(\theta\) has one or more boundaries beyond which \(\theta\) does not make scientific sense, an unbiased estimator \(\hat{\theta}_U\) of \(\theta\) can misbehave (in the sense of taking values that are outside \(\Theta\)). However, when \(\Theta\) is the entire real line, this sort of absurd behavior of \(\hat{\theta}_U\) cannot occur, and \(\hat{\theta}_U\) may provide a reasonably good estimate of \(\theta\).

  2. (25%) Consider the sampling model \((Y_i|\boldsymbol{\eta}, \boldsymbol{\mathcal{B}})\stackrel{i.i.d.}{\sim}p(y_i|\boldsymbol{\eta}, \boldsymbol{\mathcal{B}})\), for \(i=1,\cdots,n\), where the \(Y_i\) are univariate real values, \(\boldsymbol{\eta}\) is a parameter vector of length \(1\leq k<\infty\), and \(\boldsymbol{\mathcal{B}}\) summarizes your background information. A Bayesian analysis with the same sampling model would add a prior distribution layer of the form \((\boldsymbol{\eta}|\boldsymbol{\mathcal{B}})\sim p(\boldsymbol{\eta}|\boldsymbol{\mathcal{B}})\) to the hierarchy. The Bernstein-von Mises theorem says that maximum-likelihood (ML) and Bayesian inferential conclusions will be similar in this setting if (a) \(n\) is large and (b) \(p(\boldsymbol{\eta}|\boldsymbol{\mathcal{B}})\) is diffuse, but the theorem does not provide guidance on how large \(n\) needs to be for its conclusion to hold in any specific sampling model.

  3. (25%) Method-of-moments (MoM) estimators are as efficient as ML estimators, but MoM estimators have the disadvantage of often not being expressible in closed form, whereas ML estimators always have closed-form algebraic expressions.

  4. (25%) When your sampling model has n observations and a single parameter \(\theta\) (so that \(k=1\) in the notation of part 2), if the sampling model is regular (i.e., if the range of possible data values doesn’t depend on \(\theta\)), in large samples the observed information \(\hat{I}(\hat{\theta}_{MLE})\) is \(O(n)\), meaning that (a) information in \(\hat{\theta}_{MLE}\) about \(\theta\) increases linearly with \(n\) and (b) \(\hat{V}(\hat{\theta}_{MLE})=O(\frac{1}{n})\).

Proof. 1. The first part is true, because boundary mis-behavior cannot occur when the parameter space has no boundaries; when such boundaries exist (e.g., when a parameter is strictly positive) unbiased estimators can easily be absurd (e.g., by taking on negative values). Without boundary mis-behavior, unbiased estimators can indeed provide reasonably good estimates; an example is the sample mean in the i.i.d. Gaussian model with unknown mean and variance.

  1. Both the first and second parts of this statement are true. The first part has strong implications for inference, in that it explains why and when Bayesian and frequentist inferential answers will be similar; the second part is unfortunate but true, because the theorem is silent on the rate at which the frequentist \(p(\hat{\boldsymbol{\eta}}|\boldsymbol{\eta},\boldsymbol{\mathcal{B}})\) and the Bayesian \(p(\hat{\boldsymbol{\eta}}|\boldsymbol{\eta},\boldsymbol{\mathcal{B}})\) both approach normality.

  2. All three parts of this statement are false: MoM estimators are often (much) less efficient than ML estimators; MoM estimators often do in fact have closed-form expressions; and there are many settings in which ML estimators have to be solved for iteratively rather than possessing explicit formulas.

  3. Both (a) and (b) are true, and (b) has strong implications for inference, in that it implies that uncertainty about a real-valued parameter decreases, as \(n\) increases, at a \(\frac{1}{\sqrt{n}}\) rate.

Exercise 2.7 (By Draper, FYE 2013) Consider the following model for how long you will have to wait for an event \(E\), of interest to you, to happen: \[\begin{equation} p(t_i|\theta)=\frac{\theta}{t_i^2}I(t_i\geq\theta) \tag{2.15} \end{equation}\]
for some \(\theta>0\), where \(I(A)\) denotes the indicator function for set \(A\). You obtain a (conditionally i.i.d., given \(\theta\)) random sample \(t=(t_1,\cdots,t_n)\) of these waiting time, and your goal is to estimate \(\theta\) and attach a wll-calibrated measure of uncertainty to your estimate.

(1). (15%) Find the method-of-moments estimate of \(\theta\) bassed on \(t\).

(2). (30%) Identify a sufficient statistic for \(\theta\), and show that the maximum likelihood estimate of \(\theta\) based on \(t\) is given by \(\hat{\theta}_n=\min_it_i\).

(3). (20%) Using the following fact, show that the repeated-sampling distribution of \(\hat{\theta}_n\) is \[\begin{equation} p(\hat{\theta}_n|\theta)=\frac{n\theta^n}{(\hat{\theta}_n)^{n+1}}I(\hat{\theta}_n\geq\theta) \tag{2.16} \end{equation}\]
Fact: Let \(Y_{(1)},\cdots,Y_{(n)}\) denote the order statistics of an i.i.d. random sample \((Y_1,\cdots,Y_n)\) from a continuous population with c.d.f. \(F_Y(y)\) and p.d.f. \(f_Y(y)\). Then the PDF of \(Y_{(i)}\) is \[\begin{equation} f_{Y_{(i)}}(y)=\frac{n!}{(i-1)!(n-i)!}f_Y(y)[F_Y(y)]^{i-1}[1-F_Y(y)]^{n-i} \tag{2.17} \end{equation}\]

(4). (35%) Use equation (2.16) to construct an equal-tailed \(100(1-\alpha)\%\) confidence interval for \(\theta\) that is valid for all \(n\geq 1\). Hint: Is there a pivotal quantity sugested by the form of (2.16)?

Exercise 2.8 (By Draper, FYE 2012) Consider estimating the number \(0<N<\infty\) of individuals in a finite population (such as \(\mathcal{P}=\{\) the deer living on the UCSC campus as of 1 July 2012\(\}\)). One popular method for performing this estimation capture-recapture sampling; the simplest version of this approach proceeds as follows. In stage I, a random sample of \(m_0\) individuals is taken, and all of these individuals are tagged and released; then, a short time later, in stage II a second independent random sample of \(n_1\) individuals is taken, and the number \(m_1\) of these \(n_1\) individuals who were previously tagged is noted.

There are a number of ways to perform the random sampling in stage I and II; the least complicated methods are i.i.d. sampling (at random with replacement) and simple random sampling (SRS; at random without replacement). For the rest of the problem, suppose you have decide to use SRS at stage I and i.i.d. sampling at stage II, which will be denoted (SRS,i.i.d.).

(a). (10%) Briefly explain why the following sampling model follows naturally from the scientific context of the problem under (SRS,i.i.d.): \[\begin{equation} (m_1|N)\sim Bin(n_1,\frac{m_0}{N}) \tag{2.18} \end{equation}\]

(b). (Estimation of \(N\))

  • (i). (20%) Show that in model (2.18) the method-of-moments estimator \(\hat{N}_{MoM}\) and the maximum-likelihood estimator \(\hat{N}_{MLE}\) coincides and are given by \[\begin{equation} \hat{N}_{MoM}=\hat{N}_{MLE}=\hat{N}=\frac{n_1m_0}{m_1} \tag{2.19} \end{equation}\]
    (Given a full argument for why (2.19) is the global maximum of the likelihood or log-likelihood function).

  • (ii). (6%) Given the nature of (SRS,i.i.d.) sampling, why is this estimator intuitively sensible? Explain briefly.

(c). (Uncertainty assessment)

  • (i). (17%) Use the \(\Delta-\)method to show that in this model the repeated-sampling variance of \(\hat{N}\) is approximately \[\begin{equation} V_{RS}(\hat{N})\approx \frac{N^2(N-m_0)}{n_1m_0} \tag{2.20} \end{equation}\]

  • (ii). (6%) If \(m_0\to N\) were physically possible to attain in sampling from \(\mathcal{P}\) using (SRS,i.i.d.), briefly explain why it makes good scientific sense that (2.20) goes to 0 in this limit.

  • (iii). (6%) Does it make good scientific sense that (2.20) also goes to 0 under a scenario in which \(m_0\) is held fixed and \(n_1\to\infty\)? Explain briefly.

  • (iv). (15%) Use observed Fisher information to compute and approximation to the repeated-sampling estimated variance \(\hat{V}_{RS}(\hat{N})\) of \(\hat{N}_{MLE}\), thereby showing that this approximation conincides with the obvious estimate (2.20) in this case.

  • (v). (20%) Use your calculation in part (c) (iv) to give an approximate \(95\%\) confidence interval for \(N\), and explain briefly under what conditions you expect this interval to be close to frequentist-valid (and why).

Exercise 2.9 (By Draper, FYE 2012 Retake) Consider a model in which the elements of your data vector \(\mathbf{y}=(y_1,\cdots,y_n)\) are conditionally i.i.d. (given an unknown \(-1<\theta<1\)) with marginal sampling distribution \[\begin{equation} p(y_i|\theta)=\frac{1}{2}(1+\theta y_i),\quad -1<y_i<1 \tag{2.21} \end{equation}\] and 0 otherwise.

  1. (15%) Sketch the marginal sampling distribution \(p(y_i|\theta)\) in (2.21) as a function of \(y_i\) for fixed \(\theta\), taking care to distinguish its various shapes as \(\theta\) varies from \(-1\) to 1.

  2. (20%) Work out the repeated-sampling mean \(E(y_i|\theta)\) of this distribution, and use this to show that the method-of-moments estimator of \(\theta\) in this model is \(\hat{\theta}_{MoM}=3\bar{y}_n\), where \(\bar{y}_n=\frac{1}{n}\sum_{l=1}^ny_l\). Briefly explain what may go wrong with this estimator, and identify when this unpleasant behavior is most likely to occur.

  3. (10%) Is \(\hat{\theta}_{MoM}\) consistent for \(\theta\) in this model? Briefly justify your answer.

  4. (25%) Work out the log-likelihood function based on (2.21), and briefly explain how you would use it to find numerically the maximum likelihood estimator \(\hat{\theta}_{MLE}\) (do not try to solve for the MLE in closed form). Is there a one-dimensional sufficient statistic here? Explain briefly.

  5. (15%) It can be shown (you do not have to show this) that the repeated-sampling variance \(V(y_i|\theta)\) of \(y_i\) in this model is \[\begin{equation} V(y_i|\theta)=\frac{1}{3}-\frac{\theta^2}{9} \tag{2.22} \end{equation}\] Use this fact to work out an estimated standard error \(\hat{SE}(\hat{\theta}_{MoM})\) based on \(\hat{\theta}_{MoM}\). Briefly explain what may go wrong with this estimated standard error, and identify whenn this unpleasant behavior is most likely to occur.

  6. (15%) Write down a closed-form expression for a large-sample estimated standard error \(\hat{SE}(\hat{\theta}_{MLE})\) for \(\hat{\theta}_{MLE}\). Appealing to basic properties of the MLE, how do you expect \(\hat{SE}(\hat{\theta}_{MLE})\) to compare with \(\hat{SE}(\hat{\theta}_{MoM})\) for large \(n\)? Explain briefly.

Exercise 2.10 (By Abel, FYE 2010) For each of the following statements, decide if it is true or false. You must briefly justify your answer (short proof, counterexample and/or argument).

  1. (\(\frac{100}{6}\)%) Let \(\Phi(x)\) be the cumulative distribution function of the standard normal distribution and let \(Z\sim N(0, 1)\). Then \(E(\Phi(Z))=1/2\).

  2. (\(\frac{100}{6}\)%) If \(\tilde{\theta}\) is an unbiased estimator of \(\theta\), then \(\tilde{\phi}=1/\tilde{\theta}\) is an unbiased estimator of \(\phi=1/\theta\).

  3. (\(\frac{100}{6}\)%) Consider a statistical model where \(Y_{ijk}\sim N(\alpha_i\beta_j,\sigma^2)\) for \(i=1,\cdots,5,j=1,\cdots,8,k=1,\cdots,10\) and where \(\{\alpha_i\}_{i=1}^5,\{\beta_j\}_{j=1}^8\) and \(\sigma^2\) are all unknown. The resulting model is not identifiable.

  4. (\(\frac{100}{6}\)%) Let \(X_1,\cdots,X_n\) be a random sample from a continuous distribution with density \(f(x|theta)\) that depends on a unidimensional parameter \(\theta\). If \(n>2\), every minimal sufficient statistic for the problem has dimension \(k<n\).

  5. (\(\frac{100}{6}\)%) Let \(X_1,\cdots,X_n\) be a random sample where \(X_i\sim p(\cdot|\theta)\). The uniformly most powerful test of level \(\alpha\) to contrast \(H_0: \theta=\theta_0\) vs. \(H_a: \theta=\theta_1\) that is based on the likelihood ratio test is obtained by rejecting \(H_0\) if \[\begin{equation} \Lambda=\frac{\prod_{i=1}^np(x_i|\theta_1)}{\prod_{i=1}^np(x_i|\theta_0)}>k \tag{2.23} \end{equation}\] for some constant \(k\) such that \(Pr(\Lambda>k|H_0)=\alpha\).

  6. (\(\frac{100}{6}\)%) The power and the level of a statistical test must sum up to one.

Proof. 1. True. Note that \[\begin{equation} E(\Phi(Z))=\int_{-\infty}^{\infty}\Phi(Z)\phi(z)dz=\int_{0}^1tdt=\frac{1}{2} \tag{2.24} \end{equation}\]

  1. False. For a counterexample, take \(X_i\sim Exp(\lambda)\) and \(\hat{\lambda}=\bar{X}\). \(E(\hat{\lambda})=\lambda\) but \(E(1/\bar{X})=n/\{(n-1)\lambda\}\neq\lambda\).

  2. True. For example, \(\alpha_i=2\) and \(\beta_j=3\) lead to the same likelihood as \(\alpha_i=1\) and \(\beta_j=6\).

  3. False. For example, the double exponential and the t distribution with unknown location parameters have as minimal sufficient statistics the order statistics.

  4. True. This is, literally Neyman-Pearson Lemma.

  5. False. By definition.

Exercise 2.11 (By Abel, FYE 2009) For each one of the following statements, decide if it is true or false. You must justify your answer with a short proof, counterexample, or reference to a standard theorem.

  1. (\(\frac{100}{6}\)%) If \(\tilde{\theta}(X)\) is an unbiased estimator for \(\theta\), it is also consistent.

  2. (\(\frac{100}{6}\)%) If the maximum likelihood estimator exists and is unique, then it is a function of a minimal sufficient statistic for the problem..

  3. (\(\frac{100}{6}\)%) Let \(X_1,\cdots,X_n\) be a random sample where \(X_i\sim Unif(0,\theta)\). Suppose you want to test \(H_0:\theta=\theta_0\) vs. \(H_a: \theta\neq\theta_0\) using the likelihood ratio test. Then as \(n\to\infty\), \(-2\log\Lambda\stackrel{D}{\to}\chi_1^2\), where \(\Lambda\) is the likelihood ratio.

  4. (\(\frac{100}{6}\)%) In hypothesis testing, the p-value is the probability that \(H_0\) is true.

  5. (\(\frac{100}{6}\)%) Let \(X_1,\cdots,X_m\) be a random sample where \(X_i\sim Bin(n,\theta)\). The maximum likelihood estimator for \(\phi=\log(\frac{\theta}{1-\theta})\) is \(\hat{\phi}=\log(\frac{\bar{x}}{n-\bar{x}})\).

  6. (\(\frac{100}{6}\)%) Let \(X_1,\cdots,X_n\) be a random sample where \(X_i\sim p(\cdot|\theta)\), and assume that \(p(\cdot|\theta)\) satisfies the usual regularity conditons. If an unbiased estimator \(\tilde{\theta}\) for \(\theta\) has variance equal to \(1/E([\frac{\partial}{\partial\theta}\log\prod_{i=1}^np(x_i|\theta)]^2)\), then it is the minimum variance unbiased estimator.

Proof. 1. False. Suppose \(E(X)=\theta\) and take \(\tilde{\theta}(X)=X_1\).

  1. True. Use the factorization theorem.

  2. False. The support of the density depends on \(\theta\), so the conditions of the theorem do not apply.

  3. False. By definition.

  4. True. Use the invariance of the MLE.

  5. True. This is the Cramer-Rao bound.