3.1 Past FYE Problems

Exercise 3.1 (By Rajarshi, FYE 2020) Let the response \(y_i\) and the \(p\) dimensional predictor \(\mathbf{x}_i=(x_{i,1},\cdots,x_{i,p})^{\prime}\) obey the linear regression framework \[\begin{equation} y_i=\mathbf{x}_i^T\boldsymbol{\beta}+\epsilon_i,\quad \epsilon_i\stackrel{i.i.d.}{\sim} N(0,1), i=1,\cdots,n \tag{3.1} \end{equation}\] where \(n\) is the number of data points and \(\boldsymbol{\beta}=(\beta_1,cdots,\beta_p)^{\prime}\) is a \(p-\)dimensional vector of unknown regression coefficients. The notation \(^{\prime}\) denotes the transpose of a matrix or a vector. Assume the following prior distribution on the parameters: \[\begin{equation} \beta_j|\lambda_j^2,\tau^2\stackrel{ind.}{\sim}N(0,\lambda^2_j\tau^2),\quad \lambda_j\stackrel{i.i.d.}{\sim}C^+(0,1),\quad \tau\sim C^+(0,1) \tag{3.2} \end{equation}\]
where \(C^+(0,1)\) is the half-Cauchy distribution on the positive real line. We further assume that \(\lambda_j\) in \(j\) and \(\tau\) are mutually independent.

  • (i). (15%) Write down the full posterior distribution of the parameters.

  • (ii). (45%) Fact: \(\alpha^2|\eta\sim Inv-Gamma(1/2,1/\eta)\), and \(\eta\sim Inv-Gamma(1/2,1)\) together imply \(\alpha\sim C^+(0,1)\). Use the above fact to design a data augmentation procedure by introducing latent variables so that all full conditional posterior distributions appear in familiar distributional forms. Write down the full conditional distributions of all parameters (including the latent variables).

  • (iii). (25%) Describe a Gibbs sampler for drawing samples from the posterior distribution of the parameters. Discuss choices of initial values.

  • (iv). (15%) Now assume that except for \(\boldsymbol{\beta}\), all other parameters are fixed and known. Consider the loss function \(L(\boldsymbol{\beta},\mathbf{D})=(\boldsymbol{\beta}-\mathbf{D})^{\prime}(\boldsymbol{\beta}-\mathbf{D})\), where \(\mathbf{D}\) is the decision rule. Derive the Bayes estimator for \(\boldsymbol{\beta}\).

The density function of the half-Cauchy distribution is given by \[\begin{equation} f(x)=\frac{2}{\pi}\frac{1}{1+x^2},\quad x>0 \tag{3.3} \end{equation}\]

Proof. (i). The full posterior distribution for model parameters is \[\begin{equation} \begin{split} f(\boldsymbol{\beta},\boldsymbol{\lambda},\tau|&data)\propto f(\mathbf{y}|\boldsymbol{\beta},\boldsymbol{\lambda},\tau)f(\boldsymbol{\beta}|\boldsymbol{\lambda},\tau)f(\boldsymbol{\lambda})f(\tau)\\ &\propto \prod_{i=1}^nN(y_i|\mathbf{x}_i^T\boldsymbol{\beta},1)\prod_{j=1}^pN(\beta_j|0,\lambda_j^2\tau^2)\prod_{j=1}^pC^+(\lambda_j|0,1)C^+(\tau|0,1)\\ &\propto \prod_{i=1}^n \exp(-\frac{(y_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\prod_{j=1}^p(\lambda_j^2\tau^2)^{-\frac{1}{2}}\exp(-\frac{\beta_j^2}{2\lambda_j^2\tau^2})\prod_{j=1}^p\frac{1}{1+\lambda_j^2}\times\frac{1}{1+\tau^2} \end{split} \tag{3.4} \end{equation}\]

(ii). Introduce latent variables \(\eta_j\), \(j=1,\cdots,p\) and \(\omega\) such that \(\lambda_j^2|\eta_j\sim Inv-Gamma(\frac{1}{2},\frac{1}{\eta_j})\) and \(\eta_j\sim Inv-Gamma(\frac{1}{2},1)\) and also \(\tau^2|\omega\sim Inv-Gamma(\frac{1}{2},\frac{1}{\omega})\) and \(\omega\sim Inv-Gamma(\frac{1}{2},1)\), we have the posterior under the augumented model as \[\begin{equation} \begin{split} f(\boldsymbol{\beta},\boldsymbol{\lambda},\tau,\boldsymbol{\eta},\omega|data)&\propto\prod_{i=1}^n \exp(-\frac{(y_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\prod_{j=1}^p(\lambda_j^2\tau^2)^{-\frac{1}{2}}\exp(-\frac{\beta_j^2}{2\lambda_j^2\tau^2})\\ &\times\prod_{j=1}^p(\frac{1}{\eta_j})^{\frac{1}{2}}(\lambda_j^2)^{-\frac{1}{2}-1}\exp(-\frac{1/\eta_j}{\lambda_j^2})(\eta_j)^{-\frac{1}{2}-1}\exp(-\frac{1}{\eta_j})\\ &\times(\frac{1}{\omega})^{\frac{1}{2}}(\tau^2)^{-\frac{1}{2}-1}\exp(\frac{1/\omega}{\tau^2})\omega^{-\frac{1}{2}-1}\exp(-\frac{1}{\omega}) \end{split} \tag{3.5} \end{equation}\]

Therefore, denote the full conditionals for parameter \(\alpha\) as \(\alpha|\cdots\) to simplify the notation, we have the full conditionals as follows.

(1). For \(\beta_j\), \[\begin{equation} f(\beta_j|\cdots)\propto \exp(-\frac{(\beta_j-\frac{\sum_{i=1}^nx_{ij}(\mathbf{x}^T_{i,-j}\boldsymbol{\beta}_{-j}-y_i)}{\lambda_j^2\tau^2\sum_{i=1}^nx^2_{ij}+1})^2}{2\lambda_j^2\tau^2/(\lambda_j^2\tau^2\sum_{i=1}^nx^2_{ij}+1)}) \tag{3.6} \end{equation}\] Thus, the full conditional of \(\beta_j\) is \(N(\frac{\sum_{i=1}^nx_{ij}(\mathbf{x}^T_{i,-j}\boldsymbol{\beta}_{-j}-y_i)}{\lambda_j^2\tau^2\sum_{i=1}^nx^2_{ij}+1},\frac{\lambda_j^2\tau^2}{\lambda_j^2\tau^2\sum_{i=1}^nx^2_{ij}+1})\), for \(j=1,\cdots,p\).

(2). For \(\lambda_j^2\), \[\begin{equation} f(\lambda_j^2|\cdots)\propto (\lambda_j^2)^{-1-1}\exp(-(\frac{\beta_j^2}{2\tau^2}+\frac{1}{\eta_j})\frac{1}{\lambda_j^2}) \tag{3.7} \end{equation}\]
Thus, \(\lambda_j^2|\cdots\sim Inv-Gamma(1,\frac{\beta_j^2}{2\tau^2}+\frac{1}{\eta_j})\) for \(j=1,\cdots,p\).

(3). For \(\eta_j\), \[\begin{equation} f(\eta_j|\cdots)\propto (\eta_j)^{-1-1}\exp(-(\frac{1}{\lambda_j^2}+1)\frac{1}{\eta_j}) \tag{3.8} \end{equation}\] Thus, \(\eta_j|\cdots\sim Inv-Gamma(1,1+\frac{1}{\lambda^2_j})\) for \(j=1,\cdots,p\).

(4). \(\tau^2|\cdots\sim Inv-Gamma(\frac{p+1}{2},\sum_{j=1}^p\frac{\beta_j^2}{2\lambda_j^2}+\frac{1}{\omega})\) because \[\begin{equation} f(\tau^2|\cdots)\propto (\tau^2)^{-\frac{p+1}{2}-1}\exp(-(\sum_{j=1}^p\frac{\beta_j^2}{2\lambda_j^2}+\frac{1}{\omega})\frac{1}{\tau^2}) \tag{3.9} \end{equation}\]

(5). Finally, for \(\omega|\cdots\), we have \[\begin{equation} f(\omega|\cdots)\propto (\eta_j)^{-1-1}\exp(-(\frac{1}{\tau^2}+1)\frac{1}{\omega}) \tag{3.10} \end{equation}\] so the full conditional distribution of \(\omega\) is \(Inv-Gamma(1,\frac{1}{\tau^2}+1)\).

(iii). In the Gibbs sampler, giving initial values of \(\beta_j^{(0)},\lambda_j^{(0)},\eta_j^{(0)}\), \(j=1,\cdots,p\) and \(\tau^{(0)}\), \(\omega_{(0)}\), in iteration \(k=1,\cdots,S\), we draw \(k\)th sample of parameters \((\boldsymbol{\beta},\boldsymbol{\lambda},\tau,\boldsymbol{\eta},\omega)\) from their full conditionals as follows:

(1). Draw \(\beta_j^{(k)}\) from \(N(\frac{\sum_{i=1}^nx_{ij}(\mathbf{x}^T_{i,-j}\boldsymbol{\beta}^{(k-1)}_{-j}-y_i)}{(\lambda^{(k-1)}_j)^2(\tau^{(k-1)})^2\sum_{i=1}^nx^2_{ij}+1},\frac{(\lambda^{(k-1)}_j)^2(\tau^{(k-1)})^2}{(\lambda^{(k-1)}_j)^2(\tau^{(k-1)})^2\sum_{i=1}^nx^2_{ij}+1})\), for \(j=1,\cdots,p\).

(2). Draw \((\lambda_j^{(k)})^2\) from \(Inv-Gamma(1,\frac{(\beta^{(k)}_j)^2}{2(\tau^{(k-1)})^2}+\frac{1}{\eta^{(k-1)}_j})\), for \(j=1,\cdots,p\).

(3). Draw \(\eta_j^{(k)}\) from \(Inv-Gamma(1,1+\frac{1}{(\lambda^{(k)}_j)^2})\) for \(j=1,\cdots,p\).

(4). Draw \((\tau^{(k)})^2\) from \(Inv-Gamma(\frac{p+1}{2},\sum_{j=1}^p\frac{(\beta^{(k)}_j)^2}{2(\lambda^{(k)}_j)^2}+\frac{1}{\omega^{(k-1)}})\).

(5). Draw \(\omega^{(k)}\) from \(Inv-Gamma(1,\frac{1}{(\tau^{(k)})^2}+1)\).

We can set initial values for \(\beta_j\), \(j=1,\cdots,p\) to be the \(\hat{\beta}^{OLS}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\) in the linear model \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\) with \(\epsilon\sim N(0,\mathbf{I})\). The initial value for other parameters can be any one in there support. It does not matter too much because we will have a burn-in period and the Gibbs sampler is proved to be converged to the true posterior distribution if we run a long enough chain. Thus, we can just set the initial values of other parameters as 1.

(iv). When fixing other parameters, we have \[\begin{equation} f(\boldsymbol{\beta}|data)\propto \exp(\frac{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})}{2})\times\exp(-\frac{\boldsymbol{\beta}^T\Sigma\boldsymbol{\beta}}{2}) \tag{3.11} \end{equation}\] where \(\Sigma\) is a diagnoal matrix of the form \(\Sigma=(\lambda_1^2\tau^2,\cdots,\lambda_p^2\tau^2)\). Thus, by complete the square we have \(\boldsymbol{\beta}\sim N((\mathbf{X}^T\mathbf{X}+\Sigma)^{-1}\mathbf{X}^T\mathbf{y},(\mathbf{X}^T\mathbf{X}+\Sigma)^{-1})\). For quadratic loss function, the Bayes estimator is posterior mean. Thus, \(D^{(\pi)}=(\mathbf{X}^T\mathbf{X}+\Sigma)^{-1}\mathbf{X}^T\mathbf{y}\).

Exercise 3.2 (By Juhee, FYE 2018) We are given a coin and are interested in the probability \(\theta\) of observing heads when the coin is flipped. An experiment is conducted by flipping the coin (independently) in a series of trials. Suppose we have the observations of \(x\) heads and \(y\) tails (let \(n=x+y\)).

  1. (25%) Suppose that we do not have enough information to specify a model for the “series of trials” and consider two possibilities:
  • (Binomial) Suppose a fixed number of independent Bernoulli trials, \(n\), are performed and let \(y\) be the number of tails out of the \(n\) trials.

  • (Negative Binomial) Suppose that independent trials are performed until the \(y\)-th tail occurs and we let \(n\) denote the number of trials needed for this.

Will the inference for \(\theta\) be the same under the binomial and the negative binomial distribution models or not? Provide a full justification for your answer.

  1. (75%) Let \(x|\theta\sim Bin(n,\theta)\), with known n. Consider a two-component mixture of beta distributions prior for \(\theta\), more specifically: \[\begin{equation} \theta\sim g_1(\theta)=\epsilon Beta(\alpha_1,\beta_1)+(1-\epsilon)Beta(\alpha_2,\beta_2) \tag{3.12} \end{equation}\] with \(0<\epsilon<1\) and \(\alpha_1,\beta_1,\alpha_2,beta_2>0\).

(a). (25%) Find the marginal distribution \(m_1(x)\).

(b). (25%) Derive the posterior distribution for \(\theta\) given \(x\).

(c). (25%) Consider a Bayesian hypothesis test, \(H_0: \theta=\theta_0\) vs. \(H_1: \theta\neq\theta_0\), for specified \(\theta_0\). Assume the prior under the alternative is \(g_1(\theta)\) in (3.12). Let \(\rho_0\) be the prior probability that \(\theta=\theta_0\). Find the posterior probability of \(H_0\), \(\pi(H_0|x)\).

Exercise 3.3 (By Juhee, FYE 2015) Let \(X_i\) denote the number of fires in a town for week \(i\), \(i=1,\cdots,n\). Suppose the observations, \(\mathbf{x}=(x_1,\cdots,x_n)\) form an i.i.d. sample from a Poisson distribution with mean \(\theta\).

(1). (10%) Find the Jeffreys prior for \(\theta\), \(\pi_J(\theta)\).

(2). (20%) Use the prior in part (1) to find the resulting posterior distribution \(\pi_J(\theta|\mathbf{x})\).

(3). (30%) Under the posterior distribution in part (2), find the Bayes estimate of \(\theta\) with the following loss function \(L(\theta,a)=\theta(\theta-a)^2\).

(4). (40%) Now assume instead that the prior on \(\theta\) is given by an exponential distribution, with density \(\pi(\theta|\lambda)=\lambda\exp(-\lambda\theta)\) for known \(\lambda>0\). Suppose that, under the Poisson sampling model above and with the same sample \(\mathbf{x}\) of data, it is desired to test \(H_0:\theta\leq c\) versus \(H_1:\theta>c\), for some constant \(c>0\). Let \(F(t|\alpha,\beta)\) be the CDF of the \(Gamma(a,b)\) distribution, in the parameterization in which the mean is \(\alpha/\beta\).

  • (a). (35%) Using \(F(t|\alpha,\beta)\) and having observed \(\mathbf{x}\), give an explicit expression for the resulting Bayes factor in favor of \(H_0\).

  • (b). (5%) If this Bayes factor worked out numerically to be \(6.25\), what would your interpretation be about the weight of data evidence for or against \(H_0\)?

Proof. (1). \[\begin{equation} f(\mathbf{x}|\theta)=\prod f(x_i|\theta)\propto \theta^s\exp(-n\theta) \tag{3.13} \end{equation}\] in which \(s=\sum_{i=1}^nx_i\). Thus, the Fisher information is \[\begin{equation} I(\theta)=-E(\frac{s}{\theta^2})\propto \frac{1}{\theta} \tag{3.14} \end{equation}\] So the Jeffreys prior is \(\pi_J(\theta)\propto\frac{1}{\sqrt{\theta}}\).

(2). \[\begin{equation} \pi_J(\theta|\mathbf{x})\propto \theta^{s-1/2}\exp(-n\theta) \tag{3.15} \end{equation}\] This is \(Gamma(s+1/2,n)\), with the parameterization in which the kernel of the Gamma is \(\theta^{\alpha-1}e^{-\beta\theta}\) and the mean is \(\frac{\alpha}{\beta}\).

(3). As usual the optimal estimator minimizes posterior expected loss. Students will not be required to show that the continuity conditions are satisfied to allow differentiation under the integral sign. Having performed that operation, it now follows after some algebra that \[\begin{equation} \hat{\theta}=\frac{E(\theta^2|\mathbf{x})}{E(\theta|\mathbf{x})} \tag{3.16} \end{equation}\] Students now need to remember that, in the Gamma parameterization above, the variance is \(\frac{\alpha}{\beta^2}\), from which it follows that \[\begin{equation} E(\theta^2)=\frac{\alpha}{\beta^2}+(\frac{\alpha}{\beta})^2=\frac{\alpha(\alpha+1)}{\beta^2} \tag{3.17} \end{equation}\] and \[\begin{equation} \frac{E(\theta^2)}{E(\theta)}=\frac{\alpha+1}{\beta} \tag{3.18} \end{equation}\] Plugging in the posterior values \(\alpha=s+1/2\) and \(\beta=n\) yields \(\hat{\theta}=\frac{s+3/2}{n}\).

(4).

  • (a). Writing Bayes’s Theorem in odds form, with BF for the Bayes factor, \[\begin{equation} \frac{p(\theta\leq c|\mathbf{x})}{p(\theta> c|\mathbf{x})}=\frac{p(\theta\leq c)}{p(\theta> c)}\cdot BF \tag{3.19} \end{equation}\] from which \[\begin{equation} BF=\frac{p(\theta\leq c|\mathbf{x})p(\theta> c)}{p(\theta> c|\mathbf{x})p(\theta\leq c)} \tag{3.20} \end{equation}\] The \(Exp(\lambda)\) prior is just \(Gamma(1,\lambda)\) in the parameterization we are using, and with this prior (by conjugate updating) the posterior is \(Gamma(s+1,n+\lambda)\), so immediately \[\begin{equation} BF=\frac{F(c|s+1,n+\lambda)[1-F(c|1,\lambda)]}{[1-F(c|s+1,n+\lambda)]F(c|1,\lambda)} \tag{3.21} \end{equation}\] Some students will notice that \(F(c|1,\lambda)=1-e^{-c\lambda}\), but this is not required for full credit.

  • (b). In the qualitative scale proposed by Jeffreys, a Bayes factor of 6.25 in favor of \(H_0\) is substantial data evidence supporting the null hypothesis; in the words of Kass and Raftery it is positive evidence. But these interpretations are just conventional; the real meaning of 6.25 would depend on the consequences of any real-world decision that depended on whether \(\theta\leq c\). (The previous sentence would be nice to see in a student answer, but is not required for full credit.)

Exercise 3.4 (By Juhee, FYE 2015 Retake) Suppose we observe \(x\) positive responses out of \(m\) Bernoulli trials, each assumed conditionally independent given an unknown, common success probability \(\theta\).

  1. (40%) Show that the Jeffreys prior for \(\theta\) is given by a \(Beta(0.5,0.5)\) distribution. Assume the Jeffreys prior and derive the posterior distribution of \(\theta\) given \(x\).

  2. (30%) Consider \(\phi=\log(\theta/(1-\theta))\).

  • (a). (10%) State the invariance property of Jeffreys prior distribution.

  • (b). (20%) find the Jeffreys prior for \(\phi\).

  1. (30%) We plan to observe an additional \(n\) Bernoulli trials. Assume that our prior distribution for \(\theta\) is \(Beta(a,b)\), where \(a\) and \(b\) are positive real numbers. Find the predictive distribution for the future number of successes \(Y\).

Proof. 1. Since \(p(x|\theta)\propto\theta^x(1-\theta)^{m-x}\), we have \[\begin{equation} \begin{split} &\ell(\theta)\propto x\log(\theta)+(m-x)\log(1-\theta)\\ &\frac{\partial\ell(\theta)}{\partial\theta}=\frac{x}{\theta}-\frac{m-x}{1-\theta}\\ &\frac{\partial^2\ell(\theta)}{\partial\theta^2}=-\frac{x}{\theta^2}-\frac{m-x}{(1-\theta)^2}\\ &I(\theta)=-E(-\frac{x}{\theta^2}-\frac{m-x}{(1-\theta)^2})=m\theta^{-1}(1-\theta)^{-1} \end{split} \tag{3.22} \end{equation}\] Therefore, the Jeffreys prior is \(\pi(\theta)\propto\sqrt{I(\theta)}=\theta^{-1/2}(1-\theta)^{-1/2}\). That is, \(Beta(0.5,0.5)\).

Due to conjugacy, \(\pi(\theta|x)\) is \(Beta(x+1/2.m-x+1/2)\).

  1. (a). Consider a one-to-one trnsformation of \(\theta\), \(\phi=g(\theta)\). The corresponding prior on \(\phi\), \(\pi^*(\phi)=\pi(g^{-1}(\phi))|\frac{dg^{-1}(\phi)}{d\phi}|\) is also noninformative.

(b). \(\phi=\log(\frac{\theta}{1-\theta})\leftrightarrow\theta=\frac{e^{\phi}}{1+e^{\phi}}\). Thus, \[\begin{equation} \pi^*(\phi)\propto(\frac{e^{\phi}}{1+e^{\phi}})^{-1/2}(\frac{1}{1+e^{\phi}})^{-1/2}\frac{e^{\phi}}{(1+e^{\phi})^2}=\frac{e^{\phi/2}}{1+e^{\phi}} \tag{3.23} \end{equation}\]

  1. Due to conjugacy, \(\pi(\theta|x)\) is \(Beta(a+x,b+m-x)\). \[\begin{equation} \begin{split} p(y|x)&=\int_0^1\frac{\Gamma(a+b+m)}{\Gamma(a+x)\Gamma(b+m-x)}\theta^{a+x-1}(1-\theta)^{b+m-x-1}{n \choose x}\theta^y(1-\theta)^{n-y}d\theta\\ &=\frac{\Gamma(a+b+m)}{\Gamma(a+x)\Gamma(b+m-x)}\frac{\Gamma(n+1)}{\Gamma(y+1)\Gamma(n-y+1)}\\ &\times\frac{\Gamma(a+x+y)}{\Gamma(b+m-x+n-y)\Gamma(a+b+m+n)} \end{split} \tag{3.24} \end{equation}\]

Exercise 3.5 (By Raquel, FYE 2014) Assume a Bernoulli experiment in which you perform \(n\) independent trails with success probability \(\theta\), and \(X\) counts the number of successes. Then \(X|\theta\sim Bin(n,\theta)\) with \(n\) known.

  1. (20%) Assume you observe \(X=x\). Find the posterior density of \(\theta\), \(\pi(\theta|x)\), under a uniform prior \(\theta\sim Unif(0,1)\).

  2. (40%) Consider the loss function defined by \[\begin{equation} L(\theta,\hat{\theta})=\frac{(\hat{\theta}-\theta)^2}{\theta(1-\theta)} \tag{3.25} \end{equation}\] Find \(\hat{\theta}(x)\) the estimator that minimize the Bayesian expected posterior loss under the scenario described in part 1. (Note: Assume \(x\neq 0\).)

  3. (40%) Find Jeffreys prior for \(\theta\) and the corresponding posterior distribution under such prior.

Proof. 1. Since \(\pi(\theta|x)\propto \theta^x(1-\theta)^{n-x}\), the posterior is a \(Beta(x+1,n-x+1)\).

  1. The expected posterior loss is \[\begin{equation} E(L(\theta,\hat{\theta})|X=x)=\int_0^1c\frac{(\hat{\theta}-\theta)^2}{\theta(1-\theta)}\theta^{x+1-1}(1-\theta)^{n-x+1-1}d\theta \tag{3.26} \end{equation}\] with \(c=\Gamma(n+2)/(\Gamma(x+1)\Gamma(n-x+1))\). To minimize this function we take derivatives w.r.t. \(\hat{\theta}\) and so \[\begin{equation} \frac{dE(L(\theta,\hat{\theta})|X=x)}{d\hat{\theta}}=2c\int_0^1(\hat{\theta}-\theta)\theta^{x-1}(1-\theta)^{n-x-1}d\theta=0 \tag{3.27} \end{equation}\] implies that the optimal Bayes estimator is \(\hat{\theta}=x/n\).

Note that the second derivatives is \[\begin{equation} 2c\int_0^1\theta^{x-1}(1-\theta)^{n-x-1}d\theta>0 \tag{3.28} \end{equation}\]

  1. Jeffreys prior is given by \(\pi(\theta)=|I(\theta)|^{1/2}\) with \[\begin{equation} I(\theta)=-E_{X|\theta}\frac{d^2\log f(x|\theta)}{d\theta^2} \tag{3.29} \end{equation}\] In this case we obtain that \(\pi(\theta)\propto\theta^{-1/2}(1-\theta)^{-1/2}\), which corresponds to a \(Beta(1/2,1/2)\). The corresponding posterior is a \(Beta(x+1/2,n-x+1/2)\)

Exercise 3.6 (By Raquel, FYE 2014 Retake) Consider the estimation of the parameter \(\theta\in (0,\infty)\) under the loss function \[\begin{equation} L(\theta,d)=\frac{(\theta-d)^2}{\theta(\theta+1)} \tag{3.30} \end{equation}\] based on one observation \(X=x\) from the negative binomial distribution parameterized as \[\begin{equation} f(x|\theta)={{n+x-1} \choose x}\theta^x(\theta+1)^{-(n+x)},\quad x\in\{0,1,2,\cdots\} \tag{3.31} \end{equation}\] where \(n\) is known. Note that, under this parameterization, \(E(X|\theta)=n\theta\) and \(Var(X|\theta)=n\theta(\theta+1)\).

  1. (20%) Determine the risk function of the unbiased estimator \(\delta_0(x)=x/n\).

  2. (20%) Determine the risk function of the estimator \(\delta_1(x)=x/(n+1)\).

  3. (10%) Which of the two estimators (\(\delta_0(x)\) or \(\delta_1(x)\)) has the lower maximum risk? Justify your answer.

  4. (15%) Consider the following family of prior distribution for \(\theta\), \[\begin{equation} \pi(\theta)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(\theta+1)^{-(a+b)},\quad a>0,b>0 \tag{3.32} \end{equation}\] Obtain the posterior distribution for \(\theta\) under the prior in (3.32).

  5. (35%) Find the Bayes rule (estimator) under the loss function in (3.30) and the prior given in (3.32).

Some facts that may be useful:

  • Under the squared error loss function \(L(\theta,d)=(\theta-d)^2\) we have that \[\begin{equation} R(\theta,\delta(x))=Bias^2(\delta(x))+Variance(\delta(x)) \tag{3.33} \end{equation}\] with \(R(\theta,\delta(x))=\int L(\theta,\delta(x))f(x|\theta)dx\).

  • For the prior distribution in (3.32), we have that \[\begin{equation} E(\theta^c(\theta+1)^{-d})=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\frac{\Gamma(a+c)\Gamma(b+d-c)}{\Gamma(a+b+d)} \tag{3.34} \end{equation}\]

Proof. 1. First note that the risk under the loss function in (3.30) is the same as the risk under the sequare loss function divided by \(\theta(\theta+1)\). Therefore, \[\begin{equation} \begin{split} R(\theta,\delta_0(x))&=\frac{Bias^2(\delta_0(x))+Variance(\delta_0(x))}{\theta(\theta+1)}\\ &=\frac{0+\theta(\theta+1)/n}{\theta(\theta+1)}=\frac{1}{n} \end{split} \tag{3.35} \end{equation}\]

  1. Similarly, for \(\delta_1(x)=x/(n+1)\) we have \[\begin{equation} \begin{split} &Bias(\delta_1(x))^2=(E(\delta_1(x))-\theta)^2=\frac{\theta^2}{(n+1)^2}\\ &Variance(\delta_1(x))=\frac{n\theta(\theta+1)}{(n+1)^2} \end{split} \tag{3.36} \end{equation}\] Therefore, \[\begin{equation} \begin{split} R(\theta,\delta_0(x))&=\frac{Bias^2(\delta_1(x))+Variance(\delta_1(x))}{\theta(\theta+1)}\\ &=\frac{1}{(n+1)^2}(\frac{\theta}{\theta+1}+n) \end{split} \tag{3.37} \end{equation}\]

  2. \(\theta<\theta+1\) and this implies that \[\begin{equation} R(\theta,\delta_1(x))<\frac{n+1}{(n+1)^2}=\frac{1}{n+1}<\frac{1}{n}=R(\theta,\delta_0(x)) \tag{3.38} \end{equation}\]

  3. First we need to find the posterior distribution for \(\theta\). It can be shown that the posterior has the form of the prior with parameters \(a^*=a+x\) and \(b^*=b+n\).

The Bayes estimator can be found by minimizing the expected posterior loss, i.e., by finding \(\hat{\delta}\) that minimizes \[\begin{equation} \begin{split} E(L(\theta,\delta|x))&=\int\frac{(\theta-\delta)^2}{\theta(\theta+1)}\pi_{a^*,b^*}(\theta|x)d\theta\\ &=C-2\delta E((\theta+1)^{-1}|a^*,b^*)+\delta^2E(\theta^{-1}(\theta+1)^{-1}|a^*,b^*) \end{split} \tag{3.39} \end{equation}\] where \(C\) is a constant that does not depend on \(\delta\). Taking the first derivative with respect to \(\delta\) and making it equal to zero we obtain that \[\begin{equation} \hat{\delta}=\frac{E((\theta+1)^{-1}|a^*,b^*)}{E(\theta^{-1}(\theta+1)^{-1}|a^*,b^*)}=\frac{a+x-1}{b+n+1} \tag{3.40} \end{equation}\] and the second derivative is positive, so the Bayes estimator is \(\delta_{Bayes}(x)=\frac{a+x-1}{b+n+1}\).

Exercise 3.7 (By Raquel, FYE 2013) Consider \(x|\theta\sim Bin(n,\theta)\) with \(n\) known.

  1. (45%) Find Jeffreys prior, \(\pi^J(\theta)\), and the corresponding posterior distribution \(\pi^J(\theta|x)\) for this model.

  2. (20%) Consider the quadratic loss function \(L(\theta,\theta^*)=(\theta-\theta^*)^2\). Under the assumptions in part (1), find the estimator \(\theta^J(x)\) that minimizes the Bayesian posterior loss.

  3. (35%) Let \(x^*|\theta\sim Bin(m,\theta)\) with \(m\) known. Under the assumptions in part (1), find the posterior predictive distribution, i.e., find \(p^J(x^*|x)\).

Exercise 3.8 (By Abel, FYE 2012) Let \(X_1,\cdots,X_n\) be an independent and identically distributed sample such that \(x_i\sim N(\theta,1)\), where \(\theta\) is unknown. Also, assume that it is known that \(\theta\geq 0\) and that an exponential prior with mean \(\lambda\) is reasonable.

  1. (50%) Compute the maximum a posterior (MAP) estimator for \(\theta\).

  2. (15%) What utility function justifies the use of this estimator?

  3. (35%) Assuming that \(\bar{x}-\frac{1}{n\lambda}<0\), compute the \(95\%\) HPD credible interval for \(\theta\). Write your response in terms of \(\Phi\), the cumulative distribution function for the standard normal distribution, and its inverse.

Exercise 3.9 (By Abel, FYE 2012 Retake) Assume that, given unknown parameter \(\alpha>0\), your observations \(\mathbf{y}=(y_1,\cdots,y_n)\) arise i.i.d. form a Pareto distribution with density function \[\begin{equation} f(y|\alpha)=\alpha y^{-(\alpha+1)},\quad y>1 \tag{3.41} \end{equation}\]

  1. (25%) Derive the Jeffreys prior for \(\alpha\). (The Jeffreys prior is assumed for the remaining parts of the problem.)

  2. (25%) Show that the posterior for \(\alpha\) is given by a gamma-distribution and obtain its parameters.

  3. (25%) Derive the closed-form expression for the posterior predictive density, \(p(y_{new}|\mathbf{y})\), corresponding to a new observation \(y_{new}\).

  4. (25%) Consider the special case of a sample consisting of a single observation, i.e., \(n=1\) and \(\mathbf{y}=y_1\). Obtain the \(90\%\) highest posterior density interval for \(\alpha\).

Exercise 3.10 (By Herbie, FYE 2010) 1. (50%) Let \(X_1,\cdots, X_n\) be a random sample from a binomial with unknown probability of success \(\theta\), i.e., \(f(x|n\theta)={n \choose x}\theta^x(1-\theta)^{n-x}\), and put a prior on \(\theta\) such that \(f(\theta)=\frac{4}{3}−\theta^2\) for \(\theta\in[0,1]\). Give an MCMC algorithm to sample from the posterior (please write out all of the complete conditionals and use as many Gibbs steps as practical).

  1. (50%) Suppose you are doing MCMC and you need to generate your own samples from an arbitrary binomial distribution with \(n=2\). Write out the steps for using the inverse CDF method to sample from a binomial distribution with \(n=2\) and \(p=\theta\).

Exercise 3.11 (By Herbie, FYE 2009) Suppose \(X_1,\cdots,X_n|\mu,\sigma^2\stackrel{i.i.d.}{\sim}N(\mu,\sigma^2)\) with both \(\mu\) and \(\sigma^2\) unknown, and suppose we use noninformative priors, \(f(\mu)\propto 1\) and \(f(\sigma^2)\propto\frac{1}{\sigma^2}\). Show that the marginal posterior distribution for \(\mu\) is a location-scale \(t\) with \(\nu=n-1\) degrees of freedom. How does this compare to the frequentist result for the sampling distribution of the MLE?

Proof. Recall that \(s^2=\frac{1}{n-1}\sum(x_i-\bar{x})^2\), we have \[\begin{equation} \begin{split} f(\mu,\sigma^2|\mathbf{x})&\propto (2\pi\sigma^2)^{n/2}\exp\{-\frac{1}{2\sigma^2}\sum(x_i-\bar{x})^2\}(\sigma^2)^{-1}\\ &\propto(\sigma^2)^{-n/2-1}\exp\{-\frac{1}{2\sigma^2}\sum(x_i-\bar{x})^2\} \end{split} \tag{3.42} \end{equation}\] and \[\begin{equation} \begin{split} f(\mu|\mathbf{x})&\int f(\mu,\sigma^2|\mathbf{x})d\sigma^2\\ &\propto \frac{[\frac{1}{2}\sum(x_i-\bar{x})^2]^{n/2}}{\Gamma(n/2)}\propto [\sum(x_i-\mu)^2]^{n/2}\\ &\propto [n(\mu-\bar{x})^2+\sum x_i^2-n\bar{x}^2]^{n/2}\propto[n(\mu-\bar{x})^2+(n-1)s^2]^{n/2}\\ &\propto [1+\frac{1}{\frac{n-1}{n}s^2}(\mu-\bar{x})^2]^{n/2} \end{split} \tag{3.42} \end{equation}\] Thus, the marginal posterior for \(\mu\) is a location-scale \(t\) with \(n-1\) degrees of freedom, location parameter \(\bar{x}\) and scale parameter \(s^2/n\). This is the same distribution a frequentist would use as their sampling distribution.