4.1 Past FYE Problems

Exercise 4.1 (By Bruno, FYE 2020) A probit regression entails associating a set of covariates to a binary response, using a link function based on the cumulative distribution of the standard normal random variable, denoted by \(\Phi(\cdot)\). For i.i.d. samples \(y_1,\cdots,y_n\) of a Bernoulli random variable with probability of success \(p_i\), let \(p_i=\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta})\). Here, \(\mathbf{x}_i\) is a \(k\)-dimensional set of covariates for the \(i\)th observation, and \(\boldsymbol{\beta}\) is a \(k\)-dimensional vector of coefficients. Thus, the likelihood for \(\boldsymbol{\beta}\) can be obtained from the distribution \[\begin{equation} f(y_i|\boldsymbol{\beta},\mathbf{x}_i)=\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta})^{y_i}(1-\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta}))^{1-y_i} \tag{4.1} \end{equation}\]

  • (i). (30%) Introduce the latent variables \(z_1,\cdots,z_n\), \(z_i\stackrel{ind.}{\sim}N(\mathbf{x}^{\prime}_i\boldsymbol{\beta},1)\), such that \[\begin{equation} y_i=\left\{\begin{aligned} & 1 & \quad z_i>0\\ & 0 & \quad z_i\leq0 \end{aligned} \right. \tag{4.2} \end{equation}\] to augment the model. Show that the augmented model is equivalent to the original model.

  • (ii). (35%) Assuming that \(p(\boldsymbol{\beta})\propto 1\), obtain and identify the full conditionals of each \(z_i\) and \(\boldsymbol{\beta}\).

  • (iii). (35%) Suppose that the distribution of \(z_i\) is changed to \(t_{\alpha}(\mathbf{x}^{\prime}_i\boldsymbol{\beta},1)\). Use the representation \[\begin{equation} t_{\alpha}(\mu,\sigma^2)=\int_{0}^{\infty}N(\omega|\mu,\sigma^2/\lambda)Ga(\lambda|\alpha/2,\alpha/2)d\lambda \tag{4.3} \end{equation}\] where \(Ga(\lambda|a,b)\) is a gamma distribution with shape parameter \(a\) and rate parameter \(b\) such that \(E(\lambda)=a/b\), to introduce an appropriate second set of latent variables. Obtain the full conditional distributions in this case.

Proof. (i). Under the original model, we have \[\begin{equation} \begin{split} &f(y_i=1|\boldsymbol{\beta},\mathbf{x}_i)=\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta})\\ &f(y_i=0|\boldsymbol{\beta},\mathbf{x}_i)=1-\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta})\\ \end{split} \tag{4.4} \end{equation}\]

Now under the augmented model, \[\begin{equation} f(y_i=1,z_i|\boldsymbol{\beta},\mathbf{x}_i)=f(y_i=1|z_i)f(z_i|\boldsymbol{\beta},\mathbf{x}_i)=\mathbf{1}_{z_i>0}N(z_i|\boldsymbol{\beta},\mathbf{x}_i) \tag{4.5} \end{equation}\] Integrate out \(z_i\), we have \[\begin{equation} f(y_i=1|\boldsymbol{\beta},\mathbf{x}_i)=\int_{z_i>0}N(z_i|\boldsymbol{\beta},\mathbf{x}_i)dz_i=\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta}) \tag{4.6} \end{equation}\] Similarly, we can obtain \(f(y_i=0|\boldsymbol{\beta},\mathbf{x}_i)=1-\Phi(\mathbf{x}^{\prime}_i\boldsymbol{\beta})\). Since under bpth model, we always have (4.4) for \(i=1,\cdots,n\) and the data are i.i.d., the two models are therefore equivalent.

(ii). The posterior distribution is \[\begin{equation} f(\boldsymbol{\beta},\mathbf{z}|data)\propto f(y|z)f(z|\mathbf{x},\boldsymbol{\beta})f(\boldsymbol{\beta})\propto\prod_{i=1}^n (\exp(-\frac{(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\mathbf{1}_{z_i>0})^{y_i}(\exp(-\frac{(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\mathbf{1}_{z_i\leq 0})^{1-y_i} \tag{4.7} \end{equation}\]

Thus, we get the full conditionals for \(z_i\) as \[\begin{equation} f(z_i|\boldsymbol{\beta},\mathbf{x}_i,y_i=1)\propto\exp(-\frac{(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\mathbf{1}_{z_i>0} \tag{4.8} \end{equation}\] which is a truncated normal distribution, with mean \(\mathbf{x}_i^T\boldsymbol{\beta}\), variance 1 and truncated at \(z_i>0\).

Similarly, \(f(z_i|\boldsymbol{\beta},\mathbf{x}_i,y_i=0)\) is also a truncated normal distribution, with mean \(\mathbf{x}_i^T\boldsymbol{\beta}\), variance 1 and truncated at \(z_i\leq 0\).

As for \(\boldsymbol{\beta}\), \[\begin{equation} f(\boldsymbol{\beta}|\cdots)\propto\exp(-\frac{\sum_{i=1}^n(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2}) \tag{4.9} \end{equation}\] Thus, the full conditional distribution of \(\boldsymbol{\beta}\) is \(N((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Z},(\mathbf{X}^T\mathbf{X})^{-1})\).

(iii). The model can be written hierarchically as: \[\begin{equation} \begin{split} &y_i=\left\{\begin{aligned} & 1 & \quad z_i>0\\ & 0 & \quad z_i\leq0 \end{aligned} \right.\\ &z_i\stackrel{ind.}{\sim} N(\mathbf{x}_i^T\boldsymbol{\beta},\frac{1}{\lambda})\\ &\lambda\sim Gamma(\frac{\alpha}{2},\frac{\alpha}{2}) \end{split} \tag{4.10} \end{equation}\]

By the fact from the problem, this representation is true. Thus, the posterior distribution is
\[\begin{equation} f(\mathbf{z},\boldsymbol{\beta},\lambda|data)\propto \prod_{i=1}^n (\exp(-\frac{(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2\cdot\frac{1}{\lambda}})\mathbf{1}_{z_i>0}(\frac{1}{\lambda})^{-\frac{1}{2}})^{y_i}(\exp(-\frac{(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2\cdot\frac{1}{\lambda}})\mathbf{1}_{z_i\leq 0}(\frac{1}{\lambda})^{-\frac{1}{2}})^{1-y_i} \tag{4.11} \end{equation}\]

Therefore, we have the full conditionals. Firstly, \(f(z_i|\cdots)\propto \exp(-\frac{(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2\cdot\frac{1}{\lambda}}\mathbf{1}_{z_i>0}\) if \(y_i=1\). It is a truncated normal distribution with mean \(\mathbf{x}_i^T\boldsymbol{\beta}\), variance \(\frac{1}{\lambda}\) and truncated at \(z_i>0\). If \(Y_i=0\), similarly, the full conditional of \(z_i\) is a truncated normal distribution with mean \(\mathbf{x}_i^T\boldsymbol{\beta}\), variance \(\frac{1}{\lambda}\) and truncated at \(z_i\leq 0\).

Secondly, for \(\boldsymbol{\beta}\), we have \[\begin{equation} f(\boldsymbol{\beta}|\cdots)\propto\exp(-\frac{\sum_{i=1}^n(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2\cdot\frac{1}{\lambda}}) \tag{4.12} \end{equation}\] Thus, the full conditonals of \(\boldsymbol{\beta}\) is a normal distribution with mean \(\frac{1}{\lambda}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Z}\) and variance \(\frac{1}{\lambda}(\mathbf{X}^T\mathbf{X})^{-1}\).

Finally, for \(\lambda\), \[\begin{equation} f(\lambda|\cdots)\propto\lambda^{\frac{\alpha+1}{2}-1}e^{-(\frac{\alpha}{2}+\frac{\sum_{i=1}^n(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\lambda} \tag{4.13} \end{equation}\] so the full conditional of \(\lambda\) is \(Gamma(\frac{\alpha+1}{2},\frac{\alpha}{2}+\frac{\sum_{i=1}^n(z_i-\mathbf{x}_i^T\boldsymbol{\beta})^2}{2})\).

Exercise 4.2 (By Bruno, FYE 2018) To obtain a robust linear regression model, we consider an error distribution given by a double exponential, instead of a normal. As for the Student’s t-distribution case, it is possible to use the fact that the double exponential is a scale mixture of normals to obtain an efficient Gibbs sampler for exploring the posterior distribution of the model parameters. More specifically, consider the model \[\begin{equation} \mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon};\quad\mathbf{Y},\boldsymbol{\epsilon}\in\mathbb{R}^n,\quad \mathbf{X}\in\mathbb{R}^{n\times p},\quad \boldsymbol{\beta}\in\mathbb{R}^p \tag{4.14} \end{equation}\] where \[\begin{equation} p(\boldsymbol{\epsilon}|\sigma^2)=\prod_{i=1}^n\frac{1}{2\sqrt{\sigma^2}}e^{-|\epsilon_i|/\sqrt{\sigma^2}} \tag{4.15} \end{equation}\] Consider a non-informative prior \(p(\boldsymbol{\beta},\sigma^2)\propto 1/\sigma^2\).

  1. (20%) Use the fact that \[\begin{equation} \frac{1}{2\sqrt{\sigma^2}}e^{-|\epsilon_i|/\sqrt{\sigma^2}}=\int_0^{\infty}(\frac{1}{\sqrt{2\pi}\sqrt{s^2}}e^{-\epsilon_i^2/(2s^2)})(\frac{1}{2\sigma^2}e^{-s^2/(2\sigma^2)})ds^2 \tag{4.16} \end{equation}\] to write the likelihood as a scale mixture of normals.

  2. (35%) Introduce latent variables in order to write the model as a hierarchical normal linear model.

  3. (45%) Obtain and identify the posterior full conditional distributions for all the parameters of the hierarchical model.

Exercise 4.3 (By Raquel, FYE 2015) Count data with an overabundance of zeros can be modeled through the zero-inflated Poisson model (ZIP). The p.m.f. of the zero-inflated Poisson distribution, \(ZIP(\lambda,\pi)\), is given by \[\begin{equation} f(y|\lambda,\pi)=\pi I(y=0)+(1-\pi)f_0(y|\lambda),\quad y=0,1,2,\cdots, \tag{4.17} \end{equation}\] where \(0\leq\pi\leq 1\). \(I(\cdot)\) is the indicator function, and \(f_0(y|\lambda)=\lambda^ye^{-\lambda}/y!\), with \(\lambda>0\).

  1. (10%) Let \(Y\sim ZIP(\lambda,\pi)\). Find \(Pr(Y=0|\lambda,\pi)\), and \(Pr(Y=y|\lambda,\pi)\) for \(y\neq 0\).

  2. (90%) Given \(\lambda\) and \(\pi\), let \(Y_1,\cdots,Y_n\) be i.i.d. \(ZIP(\lambda,\pi)\), with priors \(\lambda\sim Gamma(a,b)\) and \(\pi\sim Beta(c,d)\). Assume this model is used to describe the number of cigarettes smoked by the respondents of a health survey. More specifically, suppose that out of 10 respondents, the number of cigarettes smoked by each respondent are: \[\begin{equation} y_1=0,\,y_2=5,\,y_3=0,\,y_4=0,\,y_5=10,\,y_6=0,\,y_7=0,\,y_8=0,\,y_9=3,\,y_{10}=0. \end{equation}\]

  • (a). (15%) Write down \(p(\lambda,\pi|y_1,\cdots,y_{10})\) up to a proportionality constant.

  • (b). (15%) Find \(p(\lambda|\pi,y_1,\cdots,y_{10})\) and \(p(\pi|y_1,\cdots,y_{10})\) (up to a proportionality constant in each case). Are these distributions available in closed form?

  • (c). (60%) Introduce auxiliary variables \(Z_i\) such that \(Y_i=0\) if \(Z_i=1\), and \(Y_i\sim Poisson(\lambda)\) if \(Z_i=0\), with \(Z_i\sim Bernoulli(\pi)\). Derive an MCMC algorithm to sample from \(p(\lambda,\pi,Z_1,\cdots,Z_{10}|y_1,\cdots,y_{10})\). Provide the details of the steps in your algorithm, i.e. for Gibbs steps, provide the full conditional distributions, for Metropolis-Hastings steps, provide the proposal distributions and the acceptance probabilities.

Proof. 1. \[\begin{equation} Pr(Y=y|\lambda,\pi)=\left\{\begin{aligned} &\pi+(1-\pi)e^{-\lambda} & \quad y=0\\ & (1-\pi)\frac{e^{-\lambda}\lambda^y}{y!} & \quad y>0 \end{aligned} \right. \tag{4.18} \end{equation}\]

  1. (a). \[\begin{equation} \begin{split} p(\lambda,\pi|y_1,\cdots,y_{10})&\propto [\pi+(1-\pi)e^{-\lambda}]^7[(1-\pi)e^{-\lambda}]^3\lambda^{18}\lambda^{a-1}e^{-b\lambda}\pi^{c-1}(1-pi)^{d-1}\\ &\propto [\pi+(1-\pi)e^{-\lambda}]^7(1-pi)^{3+d-1}\pi^{c-1}e^{-\lambda(3+b)}\lambda^{18+a-1} \end{split} \tag{4.19} \end{equation}\]

(b).
\[\begin{equation} p(\lambda|\pi,y_1,\cdots,y_{10})\propto [\pi+(1-\pi)e^{-\lambda}]^7e^{-\lambda(3+b)}\lambda^{18+a-1} \tag{4.20} \end{equation}\] and \[\begin{equation} p(\pi|y_1,\cdots,y_{10})\propto [\pi+(1-\pi)e^{-\lambda}]^7(1-pi)^{3+d-1}\pi^{c-1} \tag{4.21} \end{equation}\] These are not available in close form.

(c).
\[\begin{equation} \begin{split} p(\lambda,\pi,Z_1,\cdots,Z_{10}|y_1,\cdots,y_{10})&\propto \prod_{i=1}^{10}\pi^{Z_i}[(1-\pi)Pr(Y_i=y_i|Z_i=0)]^{1-Z_i}\\ &\times\lambda^{a-1}e^{-b\lambda}\pi^{c-1}(1-\pi)^{d-1}\\ &\propto \prod_{i=1}^{10}\pi^{Z_i}[(1-\pi)\frac{\lambda^{y_i}}{y_i!}e^{-\lambda}]^{1-Z_i}\lambda^{a-1}e^{-b\lambda}\pi^{c-1}(1-\pi)^{d-1} \end{split} \tag{4.22} \end{equation}\] Then, the full conditionals are as follows:

  • For \(i=1,\cdots,10\), \(Pr(Z_i|\lambda,\pi,y_1,\cdots,y_10)\). \[\begin{equation} Pr(Z_i=1|\lambda,\pi,y_1,\cdots,y_10)=\frac{\pi y_i!}{\pi y_i!+(1-\pi)e^{-\lambda}\lambda^{y_i}} \tag{4.23} \end{equation}\] and \[\begin{equation} Pr(Z_i=0|\lambda,\pi,y_1,\cdots,y_10)=\frac{(1-\pi)e^{-\lambda}\lambda^{y_i}}{\pi y_i!+(1-\pi)e^{-\lambda}\lambda^{y_i}} \tag{4.24} \end{equation}\]

  • \((\lambda|\pi,Z_1,\cdots,Z_{10},y_1,\cdots,y_{10})\) is drawn from a Gamma distribution, i.e. \[\begin{equation} \begin{split} p(\lambda|\pi,Z_1,\cdots,Z_{10},y_1,\cdots,y_{10})&\propto \prod_{i,Z_i=0}e^{-\lambda}\lambda^{y_i}e^{-b\lambda}\lambda^{a-1}\\ &\propto e^{-\lambda(n_0+b)}\lambda^{\sum_{i,Z_i=0}y_i+a-1} \end{split} \tag{4.25} \end{equation}\] which corresponds tp a \(Gamma((\sum_{i,Z_i=0}y_i+a),n_0+b)\), with \(n_0\) the number of \(Z_i=0\).

  • \((\pi|\lambda,Z_1,\cdots,Z_{10},y_1,\cdots,y_{10})\) is frawn from a Beta distribution, i.e. \[\begin{equation} p(\pi|\lambda,Z_1,\cdots,Z_{10},y_1,\cdots,y_{10})\propto \pi^{\sum_{i=1}^{10}Z_i+c-1}(1-\pi)^{10-\sum_{i=1}^{10}Z_i+d-1} \tag{4.26} \end{equation}\] which correspond to a \(Beta(\sum_{i=1}^{10}Z_i+c,10-\sum_{i=1}^{10}Z_i+d)\)

Exercise 4.4 (By Raquel, FYE 2015 Retake) Consider the following model for \(i=1,\cdots,n\): \[\begin{equation} \begin{split} & Y_i=\left\{\begin{aligned} & Z_i & Z_i<200 \\ & 200 & Z_i\geq 200 \end{aligned}\right.\\ & Z_i|X_i=\alpha+\beta X_i +\epsilon_i, \quad \epsilon_i\sim N(0,\sigma^2) \end{split} \tag{4.27} \end{equation}\] with \(n\) fixed, \(X_i\) known for all \(i\), and the \(\epsilon_i\), \(i=1,\cdots,n\), independent. In addition, assume the prior: \(p(\alpha,\beta,\sigma^2)\propto1/\sigma^2\).

  1. (35%) Write down \(p(\Theta|X,Y)\) up to a proportionality constant, where \(\Theta=[\alpha,\beta,\sigma^2]\), \(X=[X_1,\cdots,X_n]\) and \(Y=[Y_1,\cdots,Y_n]\).

  2. (65%) Now augument your parameter space by considering \(\Theta^*=[\Theta,Z_1,\cdots,Z_n]\). Develop a Gibbs sampling scheme to obtain draws from the joint posterior distribution of \((\Theta^*|X,Y)\).

Proof. 1. The posterior density is given by \[\begin{equation} \begin{split} &p(\Theta|X,Y)\propto\frac{1}{\sigma^2}\prod_{i,Y_i=200}\Phi(\frac{\alpha+\beta X_i-200}{\sigma})\prod_{i,Y_i\neq 200}\frac{1}{\sigma}\exp(-\frac{(Y_i-\alpha+\beta X_i)^2}{2\sigma^2})\\ &\propto(\sigma^2)^(-n_1/2+1)\exp(\frac{-\sum_{i,Y_i\neq 200}(Y_i-\alpha+\beta X_i)^2}{2\sigma^2})\prod_{i,Y_i=200}\Phi(\frac{\alpha+\beta X_i-200}{\sigma}) \end{split} \tag{4.28} \end{equation}\] with \(n_1\) represent the number of \(Y_i\neq 200\), and \(\Phi(\cdot)\) the c.d.f. of a standard normal distribution.

  1. Gibbs sampling scheme:
  • Sampling \((\alpha,\beta)\) from \(N((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Z},\sigma^2(\mathbf{X}^T\mathbf{X})^{-1})\) with \[\begin{equation} \mathbf{X}=\begin{pmatrix} 1 & X_1 \\ \vdots & \vdots \\ 1 & X_n \end{pmatrix}, \quad \mathbf{Z}=\begin{pmatrix} Z_1 \\ \vdots \\ Z_n \end{pmatrix} \tag{4.29} \end{equation}\]

  • Sampling \(\sigma^2\) from \(IG(n/2,\sum_{i=1}^n(Z_i-\alpha-\beta X_i)^2/2)\).

  • For \(i=1,\cdots,n\), sampling \(Z_i\) by \[\begin{equation} (Z_i|\alpha,\beta,X,Y,\sigma^2)=\left\{\begin{aligned} & Y_i & \quad Y_i\neq 200 \\ & TN(\alpha+\beta X_i,\sigma^2|200,\infty) & \quad Y_i=200 \end{aligned}\right. \tag{4.30} \end{equation}\]

Exercise 4.5 (By Bruno, FYE 2014) Consider the linear model \[\begin{equation} \mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim N_n(0,(1/\phi)\mathbf{I}) \tag{4.31} \end{equation}\] where \(\mathbf{y}\in\mathbb{R}^n\), \(\mathbf{X}\) is a full rank matrix of dimensions \(n\times p\), \(\boldsymbol{\beta}\in\mathbb{R}^p\), \(\mathbf{I}\) is a \(n\times n\) identity matrix and \(\phi>0\). Consider the prior \(p(\boldsymbol{\beta}|\phi)=\prod_ip(\beta_i|\phi)\) where \[\begin{equation} p(\beta_i|\phi)=(1+\frac{\phi\beta_i^2}{\nu\lambda})^{-(\nu+1)/2}\frac{\phi}{\lambda} \tag{4.32} \end{equation}\] where \(\nu\) and \(\lambda\) are known, and assume that \(p(\phi)\propto 1/\phi\).

  1. (30%) Write \(p(\beta_i|\phi)\) as a scale mixture of normals.

  2. (20%) Use the above representation to introduce latent variables that facilitate sampling the posterior distribution of all parameters using Gibbs sampling. Write the resulting model.

  3. (50%) Obtain the full conditionals for all model parameters.

Some information that may be useful:

  • Assuming the matrices and vector below are of appropriate dimensions (and also that matrix \(A+B\) is invertible), we have \[\begin{equation} \begin{split} (x-&a)^TA(x-a)+(x-b)^TB(x-b)\\ &=(x-c)^T(A+B)(x-c)+(a-b)^TA(A+B)^{-1}B(a-b) \end{split} \tag{4.33} \end{equation}\] where \(c=(A+B)^{-1}(Aa+Bb)\).

  • \[\begin{equation} \int_{0}^{\infty}\frac{e^{-\beta/x}}{x^{\alpha+1}}\frac{\beta^{\alpha}}{\Gamma{\alpha}}dx=1 \tag{4.34} \end{equation}\]

Proof. 1. \[\begin{equation} p(\beta_i|\phi)=(1+\frac{\phi\beta_i^2}{\nu\lambda})^{-(\nu+1)/2}\frac{\phi}{\lambda}\propto\int_0^{\infty}\frac{exp\{-\frac{1}{\eta_i}(\frac{\phi\beta_i^2}{2\lambda}+\frac{\nu}{2})\}}{\eta_i^{(\nu+1)/2+1}}d\eta_i \tag{4.35} \end{equation}\] Thus, \[\begin{equation} p(\beta_i|\phi)=\int_0^{\infty}N(\beta_i|0,\lambda/(\phi\eta_i))IG(\eta_i|\nu/2,\nu/2)d\eta_i \tag{4.36} \end{equation}\]

  1. \[\begin{equation} \begin{split} &\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim N_n(0,(1/\phi)\mathbf{I})\\ &p(\beta_i|\phi)=N(\beta_i|0,\lambda/(\phi\eta_i)),\quad i=1,\cdots,p\\ &p(\eta_i)=IG(\eta_i|\nu/2,\nu/2),\quad i=1,\cdots,p \end{split} (\#eq:fye207201407) \end{equation}\] and \(p(\phi)\propto 1/\phi\).

  2. \[\begin{equation} p(\boldsymbol{\beta}|\cdots)=N(\mathbf{A}^{-1}\mathbf{X}^T\mathbf{y},\lambda/\phi(\mathbf{X}^T\mathbf{X}+\mathbf{D}_{\eta}^{-1})^{-1}),\quad \mathbf{D}=diag(\eta_i) \tag{4.37} \end{equation}\] where \(\mathbf{A}=(\mathbf{X}^T\mathbf{X}+\mathbf{D}_{\eta}^{-1})\). \[\begin{equation} \begin{split} &p(\eta_i|\cdots)=IG((\nu+1)/2,(\nu+\phi\beta_i^2/\lambda)/2),\quad i=1,\cdots,p\\ &p(\phi|\cdots)=Gamma((n+p)/2,||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\boldsymbol{\beta}^T\mathbf{D}^{-1}\boldsymbol{\beta}/\lambda),\quad i=1,\cdots,p \end{split} \tag{4.38} \end{equation}\]

Exercise 4.6 (By Bruno, FYE 2014 Retake) Consider a set of observations \(x_1,\cdots,x_n,x_{n+1}^*,\cdots,x_{n+k}^*\). Ignore for now the asterisks, and assume that they are independently distributed according to \(Exp(\lambda_i)\), for \(i=1,\cdots,n+k\), where \(\lambda_i^{-1}\) is the mean of the exponential distribution for the \(i\)th observation. Consider the prior \[\begin{equation} p(\lambda_1,\cdots,\lambda_{n+k}|\beta)=\prod_{i=1}^{n+k}p(\lambda_i|\beta)=\prod_{i=1}^{n+k}Gamma(\lambda_i|\alpha,\beta) \tag{4.39} \end{equation}\] where \(\alpha\) is a known constant, and \(p(\beta)=Gamma(\beta|\alpha,\beta)\), with \(a\) and \(b\) are both known. Here, \(\Gamma(x|A,B)\) denotes the density of a Gamma distribution with shape parameter \(A>0\) and rate parameter \(B>0\), that is, \(\{B^Ax^{A−1}\exp(−Bx)\}/\Gamma(A)\), for \(x>0\). (Recall that the exponential density arises as the special case with \(A=1\).)

  1. (20%) Find the unconditional prior density \(p(\lambda_1,\cdots, \lambda_{n+k})\). Are the \(\lambda_{i}\) independent?

  2. (20%) Find the marginal posterior distribution of \(\beta\) given the sample.

Next, suppose that the observations with an asterisk are right censored. That is, the true value is missing, but we know that it is larger than the value recorded with an asterisk.

  1. (15%) Use the missing value approach to obtain the likelihood that accounts for the censoring.

  2. (15%) Find the marginal posterior distribution of \(\beta\) in this case. Is there any difference with respect to the case where the censoring is ignored?

  3. (30%) Introduce latent variables to obtain a Gibbs sample, in order to explore the posterior distribution of all model parameters. Write, explicitly, all the full conditionals, and identify them.

Proof. 1. \[\begin{equation} \begin{split} p(\lambda_1,\cdots,\lambda_{n+k})&\propto\int_0^{\infty}\prod_i\lambda_i^{\alpha-1}e^{-\lambda_i\beta}\beta^{\alpha+a-1}e^{-b\beta}d\beta\\ &=\prod_{i}\lambda_i^{\alpha-1}\int_0^{\infty}\beta^{\alpha+a-1}e^{-\beta((n+k)\bar{\lambda}+b)}d\beta\\ &=\prod_{i}\lambda_i^{\alpha-1}((n+k)\bar{\lambda}+b)^{-(\alpha+a)} \end{split} \tag{4.40} \end{equation}\] There is no independence, due to the \(\bar{\lambda}\) term. But there is exchangeability, as the ordering of the \(\lambda\)s does not affect the value of the density.

  1. The posterior distributions is proportional to \[\begin{equation} \prod_{i}\lambda_ie^{-\lambda_ix_i}\lambda_i^{\alpha-1}e^{-\lambda_i\beta}\beta^{\alpha+a-1}e^{-b\beta} \tag{4.41} \end{equation}\] Integrating out all \(\lambda_i\) we have
    \[\begin{equation} \begin{split} p(\beta|\mathbf{x},\mathbf{x}^*)&\propto\beta^{\alpha+a-1}e^{-b\beta}\prod_{i}\int_0^{\infty}\lambda_i^{\alpha+1-1}e^{-\lambda_i(x_i+\beta)}d\lambda_i\\ &=\beta^{\alpha+a-1}e^{-b\beta}\prod_{i=1}^{n+k}(x_i+\beta)^{-(\alpha+1)} \end{split} \tag{4.42} \end{equation}\]

  2. We can assume that the censored observations correspond to missing values \(\xi_j\), \(j=1,\cdots,k\), so that the full likelihood of \(x_i\) and \(\xi_j\) is given as \[\begin{equation} \prod_{i}\lambda_ie^{-\lambda_ix_i}\prod_j\lambda_{n+j}e^{-\lambda_{n+j}\xi_j} \tag{4.43} \end{equation}\] The integrated likelihood is given as \[\begin{equation} \begin{split} p(\lambda_1,\cdots,\lambda_{n+k})&\propto\prod_{i}\lambda_ie^{-\lambda_ix_i}\prod_j\int_{x_{n+j}^*}^{\infty}\lambda_{n+j}e^{-\lambda_{n+j}\xi_j}d\xi_j &=\prod_{i}\lambda_ie^{-\lambda_ix_i}\prod_je^{-\lambda_{n+j}x^*_j} \end{split} \tag{4.44} \end{equation}\]

  3. Using the integrated likelihood in the previous question, we obtain the joint posterior for all parameters as
    \[\begin{equation} \begin{split} p(\lambda_1,&\cdots,\lambda_{n+k},\beta,\alpha|\mathbf{x},\mathbf{x}^*)\propto\beta^{\alpha+a-1}e^{-b\beta}\prod_{i}\lambda_ie^{-\lambda_ix_i}\lambda_i^{\alpha-1}e^{-\lambda_i\beta}\\ &\times \prod_j\lambda^{\alpha-1}_{n+j}e^{-\lambda_{n+j}\beta}e^{-\lambda_{n+j}x^*_j}\\ &=\beta^{\alpha+a-1}e^{-b\beta}\prod_{i}\lambda_i^{\alpha+1-1}e^{-\lambda_i(x_i+\beta)}\prod_j\lambda^{\alpha-1}_{n+j}e^{-\lambda_{n+j}(x^*_j+\beta)} \end{split} \tag{4.45} \end{equation}\] Integrating \(\lambda_i\) out \(\forall i\) and \(j\), we have \[\begin{equation} p(\beta|\mathbf{x},\mathbf{x}^*)\propto\beta^{\alpha+a-1}e^{-b\beta}\prod_{i=1}^n(x_i+\beta)^{-(\alpha+1)}\prod_{j=1}^k(x_j^*+\beta)^{-\alpha} \tag{4.46} \end{equation}\] Compared to the posterior that we obtained by ignoring the censoring, we have that the last \(k\) terms in the product have exponent \(−\alpha\) instead of \(−(\alpha+1)\).

  4. The latent variables are already introduced in part 3. The full conditionals are:

  • Firstly, for \(\lambda_i\), \(i=1,\cdots,n+k\), \[\begin{equation} p(\lambda_i|\cdots)\propto\lambda_i^{\alpha+1-1}e^{-\lambda_i(z_i+\beta)} \tag{4.47} \end{equation}\] where \(z_i=x_i\) if \(i\leq n\), and \(z_{n+j}=\xi_j\) for \(j=1,\cdots,k\). This corresponds to a \(Gamma(\lambda_i|\alpha+1,z_i+\beta)\).

  • Then, for \(\beta\) \[\begin{equation} p(\beta|\cdots)\propto\beta^{\alpha+a-1}e^{-\beta(b+(n+k)\bar{\lambda})} \tag{4.48} \end{equation}\] This corresponds to a \(Gamma(\beta|\alpha+a,b+(n+k)\bar{\lambda})\).

  • Finally, \[\begin{equation} p(\xi_j|\cdots)\propto\lambda_{n+j}e^{-\lambda_{n+j}\xi_j} \tag{4.49} \end{equation}\] for \(\xi_j\geq x_{n+j}^*\), \(j=1,\cdots,k\). This corresponding to a truncated exponential with parameter \(\lambda_{n+j}\) and truncation \(x_j^*\).
Exercise 4.7 (By Bruno, FYE 2013) Consider the linear model \[\begin{equation} \mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim N_n(0,(1/\phi)\mathbf{I}) \tag{4.50} \end{equation}\]
where \(\mathbf{y}\in\mathbb{R}^n\), \(\mathbf{X}\) is a full rank matrix of dimensions \(n\times p\), \(\boldsymbol{\beta}\in\mathbb{R}^p\), \(\mathbf{I}\) is a \(n\times n\) identity matrix and \(\phi>0\). Consider the prior \(p(\boldsymbol{\beta}|\boldsymbol{\gamma},\phi)=\prod_ip(\beta_i|\gamma_i,\phi)\) where \[\begin{equation} p(\beta_i|\gamma_i,\phi)=(1-\gamma_i)N(\beta_i|0,1/\phi)+\gamma_iN(\beta_i|0,c_i/\phi) \tag{4.51} \end{equation}\] Here \(c_i>0\) is a known constant, and \(\gamma_i\) is a binary variable taking value 1 with probability \(\omega_i\) and 0 with probability \(1-\omega_i\). Thus, \[\begin{equation} p(\boldsymbol{\gamma})=\prod_{i=1}^p\omega_i^{\gamma_i}(1-\omega_i)^{1-\gamma_i},\quad \gamma_i=0,1,\quad \omega_i\in(0,1) \tag{4.52} \end{equation}\] To complete the model, we assume that \(p(\phi)=Gamma(\phi|a_{\phi},b_{\phi})\), and \[\begin{equation} p(\boldsymbol{\omega})=\prod_{i=1}^pBeta(\omega_i|a_{\omega},b_{\omega}) \tag{4.53} \end{equation}\] Write the full conditional distributions needed to explore the posterior distribution of the parameters in the model, using a Gibbs sampler. Hint: \[\begin{equation} \begin{split} (x-a)^TA(x-a)+&(x-b)^TB(x-b)\\ &=(x-c)^T(A+B)(x-c)+(a-b)^TA(A+B)^{-1}B(a-b) \end{split} \tag{4.54} \end{equation}\] where \(c=(A+B)^{-1}(Aa+Bb)\).

Exercise 4.8 (By Bruno, FYE 2012) Consider the linear model \[\begin{equation} \mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim N_n(0,(1/\phi)\mathbf{V}_{\psi}) \tag{4.55} \end{equation}\]
where \(\mathbf{y}\in\mathbb{R}^n\), \(\mathbf{X}\) is a full rank matrix of dimensions \(n\times p\), \(\boldsymbol{\beta}\in\mathbb{R}^p\), and \(\mathbf{V}_{\psi}\) is a \(p\times p\) matrix that is a known function \(\psi\).

  1. (35%) Assume that \(\psi\) is known. Find the sufficient statistics for \(\boldsymbol{\beta}\) and \(\phi\).

  2. (45%) Consider the prior \[\begin{equation} p(\boldsymbol{\beta},\phi|\psi)\propto\frac{1}{\phi}N_n(0,g/\phi(\mathbf{X}^T\mathbf{V}_{\psi}^{-1}\mathbf{X})^{-1}) \tag{4.56} \end{equation}\]
    Write the posterior distribution of \(\boldsymbol{\beta}\) and \(\phi\) as \[\begin{equation} p(\boldsymbol{\beta},\phi|\mathbf{y},\mathbf{X},\psi)=p(\boldsymbol{\beta}|\phi,\mathbf{y},\mathbf{X},\psi)p(\phi|\mathbf{y},\mathbf{X},\psi) \tag{4.57} \end{equation}\] and specific the densities for each one of the two factors in this expression as a function of the sufficient statistics.

  3. (20%) Suppose \(\psi\) is unknown. Obtain the posterior \(p(\psi|\mathbf{y},\mathbf{X})\). Hint: \[\begin{equation} \begin{split} (x-a)^TA(x-a)+&(x-b)^TB(x-b)\\ &=(x-c)^T(A+B)(x-c)+(a-b)^TA(A+B)^{-1}B(a-b) \end{split} \tag{4.58} \end{equation}\]

Exercise 4.9 (By Bruno, FYE 2012 Retake) Ten water samples are taken at a given site in the Monterey Bay and temperature \(y_1\) and salinity \(y_2\) are recorded. As a result we have the sample \(\mathbf{y}_i=(y_{1,i},y_{2,i})\), for \(i=1,\cdots,10\). Assume that all \(\mathbf{y}_i\) are independent and that their distribution corresponds to a bivariate normal with mean \(\boldsymbol{\mu}\in\mathbb{R}^2\) and covariance matrix \(V\in\mathbb{R}^{2\times 2}\).

  1. (35%) Write the likelihood function for \(\boldsymbol{\mu}\) and \(V\) and find the sufficient statistics.

  2. (65%) Consider the prior \(p(\boldsymbol{\mu},V)\propto p(V)\), where \(p(V)\) is an inverse Wishart distribution with parameters \(\nu\) and \(S_0\). Thus, the prior is proportional to \[\begin{equation} |V|^{-(\nu+k+1)/2}\exp\{-\frac{1}{2}S_0V^{-1}\} \tag{4.59} \end{equation}\] where \(k\) is the dimension of \(V\), in this case \(k=2\).

Assume that observation \(y_{22}\) is missing. We resort to a Markov Chain Monte Carlo approach to obtain samples of \(\mu\), \(V\) and the missing observation. For that purpose, write the full conditionals that correspond to the Gibbs sampler.

Some information that may be useful:

  • Recall that the trace of a product of matrices is invariant under cyclic permutations; for instance, for three matrices \(A\in\mathbb{R}^{n\times m},B\in\mathbb{R}^{m\times k}\) and \(C\in\mathbb{R}^{k\times n}\), we have \(tr(ABC)=tr(BCA)=tr(CAB)\).

  • Consider a bivariate normal vector \[\begin{equation} \begin{pmatrix} x\\ y \end{pmatrix}\sim N_2(\begin{pmatrix} \mu_x\\ \mu_y \end{pmatrix},\begin{pmatrix} v_x & v_{xy}\\ v_{yx} & v_y \end{pmatrix}) \tag{4.60} \end{equation}\] then \[\begin{equation} p(x|y)=N_1(\mu_x+\frac{v_{xy}}{v_y}(y-\mu_y),v_x-\frac{v_{xy}^2}{v_y}) \tag{4.61} \end{equation}\]

Exercise 4.10 (By Draper, FYE 2010) Consider the following random-effects hierarchical model, which is useful in meta-analysis and other applications: \[\begin{equation} \begin{split} &(\mu,\sigma^2)\sim p(\mu,\sigma^2)\\ &(\gamma_i|\mu,\sigma^2)\stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)\\ &(y_i|\gamma_i)\stackrel{ind.}{\sim} N(\gamma_i,\tau_i^2) \end{split} \tag{4.62} \end{equation}\] for \(i=1,\cdots,I\), in which the \(\tau_i^2\) are assumed known.

  1. (40%) With \(\theta=(\mu,\sigma^2)\) and an appropriate choice for latent data \(z\), specify the two distributions \(p(\theta|z,y)\) and \(p(z|\theta,y)\) needed to carry out an EM algorithm to find th posterior mode of \(\theta\) given \(y=(y_1,\cdots,y_I)\), making an appropriate conditionally conjugate choice for the prior distribution on \(\theta\), and use this to specify the \(E\) and \(M\) steps of your EM algorithm.

  2. (60%) Specify a Gibbs sampler to make random draws from the augumented posterior distribution \(p(\mu,\sigma^2,z|y)\), by providing details on the full-conditional distributions \(p(\mu|\sigma^2,z,y),p(\sigma^2|\mu,z,y)\) and \(p(z|\mu,\sigma^2,y)\).

Exercise 4.11 (By Thanos, FYE 2010 (Actually a GLM Problem)) Consider a Bernoulli regression model for a vector of covariates \(\mathbf{x}\) and a binary response \(y\), with observed values \(\mathbf{x}_i\) and \(y_i\), \(i=1,\cdots,n\), respectively. Specifically, \[\begin{equation} y_i|\boldsymbol{\beta}\stackrel{ind.}{\sim} \{\Phi(\mathbf{x}_i^T\boldsymbol{\beta})\}^{y_i}\{1-\Phi(\mathbf{x}_i^T\boldsymbol{\beta})\}^{1-y_i},\quad i=1,\cdots,n \tag{4.63} \end{equation}\] where \(\Phi(\cdot)\) denotes the standard normal c.d.f., and \(\boldsymbol{\beta}\) is the vector of regression coefficients, which is assigned a flat prior. Assume that the design matrix of covariates \(\mathbf{X}\) is of full rank.

Moreover, assume that the binary response \(y\) arises by discretizing an underlying (real-valued) continuous response \(z\). Specifically, we observe \(y_i=1\) if \(z_i>0\) or \(y_i=0\) if \(z_i\leq 0\), where \(z_i\) is the latent (unobserved) continuous response corresponding to observed response \(y_i\), for \(i=1,\cdots,n\). Consider the following augmented regression model \[\begin{equation} \begin{split} &y_i|z_i\stackrel{ind.}{\sim} \{1(z_i>0)1(y_i=1)+1(z_i\leq 0)1(y_i=0)\},\quad i=1,\cdots,n\\ &z_i|\boldsymbol{\beta}\stackrel{ind.}{\sim} N(\mathbf{x}_i^T\boldsymbol{\beta},1),\quad i=1,\cdots,n \end{split} \tag{4.64} \end{equation}\] where \(1(\upsilon\in A)\) denotes the indicator function with value \(1\) when \(\upsilon\in A\). Hence, note that the first stage of model (4.64) denotes the (degenerate) distribution for \(y_i\) conditional on the sign of \(z_i\) as described above.

  1. (30%) Show how model (4.63) can be obtained from model (4.64).

  2. (70%) Use the model (4.64) representation to develop a Gibbs sampler for the posterior distribution of \(\boldsymbol{\beta}\). Provide details of the posterior full conditionals for \(\boldsymbol{\beta}\) and for the \(z_i\), \(i=1,\cdots,n\).

Exercise 4.12 (By Raquel, FYE 2009) 1. (40%) Consider the following model \[\begin{equation} \begin{split} &y_{i,j}|\theta_i\stackrel{ind.}{\sim} N(\theta_i,\upsilon),\quad i=1,\cdots,I, j=1,\cdots,n_i\\ &\theta_i\stackrel{i.i.d.}{\sim} N(\mu,\sigma^2),\quad i=1,\cdots,I \end{split} \tag{4.65} \end{equation}\] where \(\upsilon,\mu\) and \(\sigma^2\) are assumed known.

Write the model as a linear regression model and find \(p(\theta_1,\cdots,\theta_I|\mathbf{y})\), where \(\mathbf{y}=\{y_{i,j},i=1,\cdots,I, j=1,\cdots,n_i\}\). What is the mode of this distribution?

  1. (60%) Consider again the model from part 1, but now assume that \(\mu\) is unknown, with \[\begin{equation} \begin{split} &\mu|\phi\sim N(m,\phi)\\ &\phi\sim Inv-\chi^2(\nu) \end{split} \tag{4.66} \end{equation}\] where \(m\) and \(\nu\) are fixed (and \(\upsilon\) and \(\sigma^2\) are still assumed known).
Write down the joint posterior $\(p(\theta_1,\cdots,\theta_I,\mu,\phi|\mathbf{y})\) (up to a proportionality constant), and show that samples from this posterior distribution can be obtained via Gibbs sampling. Provide the details of this algorithm (i.e., find all the full conditional distributions.)

Proof. 1. Linear model representation: \[\begin{equation} \begin{split} &\mathbf{y}=\mathbf{X}\boldsymbol{\theta}+\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim N(\mathbf{0},\Sigma_y)\\ &\boldsymbol{\theta}=\mathbf{1}\mu+\boldsymbol{\xi},\quad \boldsymbol{\xi}\sim N(\mathbf{0},\Sigma_{\theta}) \end{split} \tag{4.67} \end{equation}\] with \(\Sigma=\upsilon\mathbf{I}_N\), \(\Sigma_{\theta}=\sigma^2\mathbf{I}_I\) and \(N=\sum_in_i\).

The posterior distribution is multivariate normal. The mode could be found by maxi- mizing the log-posterior, by writing the model above as a linear regression model with a non-informative prior and using some results from class, or by using properties of the multivariate normal distribution. The first and second options appear below.

(a). The log-posterior is given by \[\begin{equation} \log(p(\theta_1,\cdots,\theta_I|\mathbf{y}))=K-\frac{\sum_{i=1}^I\sum_{j=1}^{n_i}(y_{i,j}-\theta_i)^2}{2\upsilon}-\frac{\sum_{i=1}^I(\theta_i-\mu)^2}{2\sigma^2} \tag{4.67} \end{equation}\] Now, \[\begin{equation} \frac{\partial\log(p(\boldsymbol{\theta}|\mathbf{y}))}{\partial\theta_i}=\frac{\sum_{j=1}^{n_i}(y_{i,j}-\theta_i)}{\upsilon}-\frac{(\theta_i-\mu}{\sigma^2} \tag{4.68} \end{equation}\] and so, \[\begin{equation} \hat{\theta}_i=\frac{\mu/\sigma^2+\sum_{j=1}^{n_i}y_{i,j}/\upsilon}{1/\sigma^2+n_i/\upsilon} \tag{4.69} \end{equation}\]

(b). Write the model as \(\mathbf{y}_*=\mathbf{X}_*\boldsymbol{\theta}+\boldsymbol{\epsilon}_*\), with \(\boldsymbol{\epsilon}_*\sim N(\mathbf{0},\Sigma_*)\), \(p(\boldsymbol{\theta})\propto 1\) and \[\begin{equation} \mathbf{y}_*=\begin{pmatrix} \mathbf{y} \\ \mu \\ \vdots \\ \mu \end{pmatrix}, \quad \mathbf{X}_*=\begin{pmatrix} \mathbf{X} \\ \mathbf{I}_I \end{pmatrix},\quad \Sigma_*=blcokdiag(\Sigma_y,\Sigma_{\theta}) \tag{4.70} \end{equation}\] Then, the posterior distribution is \(p(\boldsymbol{\theta}|\mathbf{y})=N(\hat{\boldsymbol{\theta}},\mathbf{V}_{\theta})\), with
\[\begin{equation} \hat{\boldsymbol{\theta}}=(\mathbf{X}_*^T\Sigma_*^{-1}\mathbf{X}_*)^{-1}\mathbf{X}_*^T\Sigma_*^{-1}\mathbf{y}_* \tag{4.71} \end{equation}\] and \(\mathbf{V}_{\theta}=(\mathbf{X}_*^T\Sigma_*^{-1}\mathbf{X}_*)^{-1}\), and so, the posterior model is \(\hat{\boldsymbol{\theta}}\).

  1. Gibbse sampling.
  • (a). \(p(\theta_i|\boldsymbol{\theta}_{(-i)},\mu,\phi,\mathbf{y})=N(\theta_i|\hat{\theta}_i,V_{\theta_i})\), with \[\begin{equation} \begin{split} &V_{\theta_i}=(\frac{1}{\sigma^2}+\frac{n_i}{\nu})^{-1}\\ &\hat{\theta}_i=V_{\theta_i}(\frac{\mu}{\sigma^2}+\frac{\sum_{j=1}^{n_i}y_{i,j}}{\upsilon}) \end{split} \tag{4.72} \end{equation}\]

  • (b). \(p(\mu|\boldsymbol{\theta},\phi,\mathbf{y})=N(\mu|m_{\mu},V_{\mu})\) with \[\begin{equation} \begin{split} &V_{\mu}=(\frac{1}{\sigma^2}+\frac{1}{\phi})^{-1}\\ &m_{\mu}=V_{\mu}(\frac{\sum_{i=1}^I\theta_i}{\sigma^2}+\frac{m}{\phi}) \end{split} \tag{4.73} \end{equation}\]

  • (c). \(p(\phi|\mu,\boldsymbol{\theta},\mathbf{y})=Inv-\chi^2(\phi|\nu+1,(\mu-m)^2+1)\).

Exercise 4.13 (By Thanos, FYE 2009 (Actually a GLM Problem)) 1. (20%) Consider the Poisson distribution with mean \(\mu\), denoted by \(Poisson(\cdot;\mu)\). Write the probability mass function of the distribution in the exponential dispersion family form, and provide the expressions for the natural parameter and the variance function.

  1. (40%) Let \(y_i\) be realizations of independent random variables \(Y_i\) with \(Poisson(\cdot;\mu)\) distributions, for \(i=1,\cdots,n\) (where \(E(Y_i)=\mu_i\)). Consider the Poisson generalized linear model (glm) based on link function \(\log(\mu_i)=\mathbf{x}_i^T\boldsymbol{\beta}\) where \(\mathbf{x}_i\) is the covariate vector for reponse \(y_i\) and \(\boldsymbol{\beta}\) is the vector of regression coefficients. Obtain the deviance for the comparison of the Poisson glm above with the full model that includes a different \(\mu_i\) for each \(y_i\).

  2. (40%) Consider a Bayesian formulation for the special case of the Poisson glm from part 2 based on a single covariate with values \(x_i\). That is, \[\begin{equation} y_i|\beta_1,\beta_2\stackrel{ind.}{\sim}Poisson(y_i;\mu_i=\exp(\beta_1+\beta_2x_i)),\quad i=1,\cdots,n \tag{4.74} \end{equation}\] with \(N(a_j,\beta_j^2)\) priors for \(\beta_j\), \(j=1,2\). Derive a Metropolis-Hasting algorithm to sample from \(p(\beta_1,\beta_2|data)\), the posterior distribution of \((\beta_1,\beta_2)\), where data is \(\{(y_i,x_i):i=1,\cdots,n\}\).

Proof. 1. For the Poisson p.m.f. with mean \(\mu\), we have \[\begin{equation} f(y;\mu)=\mu^y\exp(-\mu)(y!)^{-1}=\exp\{y\log(\mu)-\mu-\log(y!)\},\quad y\in\{0,1,2,\cdots\} \tag{4.75} \end{equation}\] which is of the EDF form with natural parameter \(\theta=\log(\mu)\), \(b(\theta)=\exp(\theta)\), and dispersion parameter \(\phi=1\). The variance function is given by \(b^{\prime\prime}(\theta)=\exp(\theta)=\mu\).

  1. Under the full model, the ML estimates are \(\tilde{\mu}_i=y_i\) and thus, \(\tilde{\theta}_i=\log(y_i)\). Under the Poisson glm with the logarithmic link function, we have \(\hat{\theta}_i=\log(\hat{\mu}_i)=\mathbf{x}_i^T\hat{\boldsymbol{\beta}}\), where \(\hat{\boldsymbol{\beta}}\) is the vector of ML estimates for \(\boldsymbol{\beta}\).

The deviance is given, in general, by \(D=-2[\ell(\hat{\boldsymbol{\theta}};data)-\ell(\tilde{\boldsymbol{\theta}};data)]\), where \(\hat{\boldsymbol{\theta}}=\{\hat{\theta}_i:i=1,\cdots,n\}\), \(\tilde{\boldsymbol{\theta}}=\{\tilde{\theta}_i:i=1,\cdots,n\}\), and \(\ell(\boldsymbol{\theta};data)\) is the log-likelihood function (expressed as a function of the natural parameters). Using the Poisson likelihood form and the expressions for \(\tilde{\boldsymbol{\theta}}\) and \(\hat{\boldsymbol{\theta}}\) above, we obtain \[\begin{equation} \begin{split} D&=2\sum_{i=1}^n[y_i\log(\frac{y_i}{\hat{\mu}_i})-(y_i-\hat{\mu}_i)]\\ &=2\sum_{i=1}^n[(y_i\log(y_i)-y_i)+(\exp(\mathbf{x}_i^T\hat{\boldsymbol{\beta}})-y_i(\mathbf{x}_i^T\hat{\boldsymbol{\beta}}))] \end{split} \tag{4.76} \end{equation}\]

  1. We have \[\begin{equation} \begin{split} p(\beta_1,\beta_2)&\propto \exp[-\frac{(\beta_1-a_1)^2}{2b_1^2}-\frac{(\beta_2-a_2)^2}{2b_2^2}\\ &+\sum_{i=1}^n(y_i(\beta_1+\beta_2x_i)-\exp(\beta_1+\beta_2x_i))]\\ &=h(\beta_1,\beta_2) \end{split} \tag{4.77} \end{equation}\]

A random walk Metropolis-Hastings algorithm, with a normal proposal distribution, would proceed as follows:

  • Let \((\beta_1^{(t)},\beta_2^{(t)})\) be the current state of the chain.

  • Draw \((\beta_1^*,\beta_2^*)\) from a normal distribution with mean vector \((\beta_1^{(t)},\beta_2^{(t)})\) and covariance matrix \(D\), which can be specified through the inverse of the expected Fisher information matrix estimated at the MLEs for \((\beta_1,\beta_2)\).

  • Set \((\beta_1^{(t+1)},\beta_2^{(t+1)})=(\beta_1^*,\beta_2^*)\) with probability \(\min\{1,h(\beta_1^*,\beta_2^*)/h(\beta_1^{(t)},\beta_2^{(t)})\}\), and \((\beta_1^{(t+1)},\beta_2^{(t+1)})=(\beta_1^{(t)},\beta_2^{(t)})\) otherwise.

The full Bayesian model that includes the distritbuion component for \((y_0,x_0)\) is given by \[\begin{equation} p(y_0,\beta_1,\beta_2|x_0,data)=Poisson(y_0;\mu_0=\exp(\beta_1+\beta_2x_0))p(\beta_1,\beta_2|data) \tag{4.78} \end{equation}\] and, therefore, the posterior predictive distribution of new response \(y_0\), associated with specified covariate value \(x_0\), is given by \[\begin{equation} p(y_0|x_0,data)=\int\int Poisson(y_0;\mu_0=\exp(\beta_1+\beta_2x_0))p(\beta_1,\beta_2|data)d\beta_1d\beta_2 \tag{4.79} \end{equation}\]