Chapter 2 Parameter Estimation

Suppose \(Y \sim Bin(n, \theta)\), and \(\theta \sim Be(a, b)\), then \(p(\theta\mid y) \propto \theta^{a + y - 1}(1-\theta)^{b+n-y-1}\). Thus \(\theta\mid y \sim Be(a+y, b+n-y)\).

A \(100(1 - \alpha)\%\) credible interval is any interval in the posterior that contains the parameter with probability \((1- \alpha)\). Define a loss function \(L(\theta, \hat\theta) = -U(\theta, \hat\theta)\) where \(\theta\) is the parameter of interest and \(\hat\theta = \hat\theta(y)\) is the estimator of \(\theta\).

Find the estimator that minimizes the expected loss: \(\hat\theta_{Bayes} = \arg\min_{\hat\theta}E\left[ L(\theta, \hat\theta)\mid y\right]\). Common estimators are:

  • Mean: \(\hat\theta_{Bayes} = E(\theta\mid y)\) minimizes \(L(\theta, \hat\theta) = (\theta, \hat\theta)^2\).
  • Median: \(\int_{\hat\theta}^\infty p(\theta\mid y) d\theta = \frac{1}{2}\) minimizes \(L(\theta, \hat\theta) = |\theta - \hat\theta |\).
  • Mode: \(\hat\theta_{Bayes} = \arg\max_{\theta} p(\theta\mid y)\) is obtained by minimizing \(L(\theta, \hat\theta) = -I(|\theta- \hat\theta) < \epsilon)\) as \(\epsilon \rightarrow 0\), also called maximum a posterior (MAP) estimator.

A \(100(1- \alpha)\%\) credible interval is any interval \((L, U)\) such that \(1 - \alpha = \int_{L}^U p(\theta\mid y) d\theta\). Some typical intervals are

  • Equal-tailed: \(\alpha/2 = \int_{-\infty}^L p(\theta\mid y)d\theta = \int_{U}^\infty p(\theta\mid y)d\theta\).
  • One-sided: either \(L = -\infty\) or \(U = \infty\)
  • Highest posterior density (HPD): \(p(L\mid y) = p(U\mid y)\)

A prior probability distribution, often called simply the prior, of an uncertain quantity \(\theta\) is the probability distribution that would express one’s uncertainty about \(\theta\) before the “data” is taken into account.

A prior \(p(\theta)\) is conjugate if for \(p(\theta) \in \mathcal{P}\) and \(p(y\mid \theta) \in \mathcal{F}\), \(p(\theta\mid y) \in \mathcal{P}\) where \(\mathcal{F}\) and $ are families of distributions.

Example: the beta distribution (\(\mathcal{P}\)) is conjugate to the binomial distribution Discrete priors are conjugate. Mixture of conjugate priors are conjugate.

A natural conjugate prior is a conjugate prior that has the same functional form as the likelihood.

Example: the beta distribution is a natural conjugate prior since \(p(\theta) \propto \theta^{a-1}(1-\theta)^{b-1}\) and \(L(\theta) \propto \theta^y(1-\theta)^{n-y}\).

Mixture of conjugate priors: Suppose \(\theta \sim \sum_{i = 1}^I p_ip_i(\theta)\), \(\sum_{i=1}^I p_i = 1\) and \(p_i(y) = \int p(y\mid \theta) p_i(\theta)d\theta\), then \(\theta\mid y \sim \sum_{i=1}^I p_i'p_i(\theta\mid y)\), \(p_i' \propto p_ip_i(y)\) where \(p_i(\theta\mid y) = p(y \mid \theta)p_i(\theta)/p_i(y)\).

Example: \(Y \sim Bin(n, \theta)\), \(\theta \sim pBe(a_1, b_1) + (1-p)Be(a_2, b_2)\), then \[ \theta\mid y \sim p'Be(a_1 + y, b_1 + n - y) + (1 - p')Be(a_2 + y, b_2 + n - y) \] with \(p' = \frac{pp_1(y)}{pp_1(y) + (1-p)p_2(y)}\), \(p_i(y) = {n \choose y} \frac{Beta(a_i + y, b_i + n - y)}{Beta(a_i, b_i)}\).

A default prior is used when a data analyst is unable or unwilling to specify an informative prior distribution.

Suppose we use \(\phi = \log (\theta/(1-\theta))\), the log odds of the parameter, and set \(p(\phi) \propto 1\), then the implied prior on \(\theta\) is \[ \begin{aligned} p_{\theta}(\theta) \propto & 1\left|\frac{d}{d \theta} \log (\theta /[1-\theta])\right| \\ &=\frac{1-\theta}{\theta}\left[\frac{1}{1-\theta}+\frac{\theta}{[1-\theta]^{2}}\right] \\ &=\frac{1-\theta}{\theta}\left[\frac{[1-\theta]+\theta}{[1-\theta]^{2}}\right] \\ \rightarrow \quad &=\theta^{-1}[1-\theta]^{-1} \end{aligned} \] a \(Be(0, 0)\) if that were a proper distribution, and is different from setting \(p(\theta) \propto 1\) which results in the \(Be(1,1)\) prior.

Jeffreys prior is a prior that is invariant to parameterization and is obtained via \[ p(\theta) \propto \sqrt{\text{det} \space \mathcal{I}(\theta)} \] where \(\mathcal{I}(\theta)\) is the Fisher information.

Example: for a binomial distribution \(\mathcal{I}(\theta) = \frac{n}{\theta(1-\theta)}\) so \(p(\theta) \propto \theta^{-1/2} (1-\theta)^{-1/2}\).

Improper prior: An unnormalized density, \(f(\theta)\), is proper if \(\int f(\theta)d\theta = c < \infty\), and otherwise it is improper.

Example: \(Be(0, 0)\) is an improper distribution. Suppose \(Y \sim Bin(n, \theta)\), the posterior \(\theta\mid y \sim Be(y, n-y)\) is proper if \(0 < y < n\).

Normal Distribution

Thm: If \(Y_i \stackrel{iid}{\sim} N(\mu, s^2)\) (\(s^2\) is known), Jeffreys prior for \(\mu\) is \(p(\mu) \propto 1\) and the posterior is \[ p(\mu \mid y) \propto \exp\left(-\frac{1}{2s^2/n} [\mu^2 - 2\mu\bar y] \right) \sim N(\bar y, s^2/n) \] The natural conjugate prior is \(\mu \sim N(m, C)\) and the posterior \(\mu\mid y \sim N(m', C')\) where \(C' = [\frac{1}{C} + \frac{n}{s^2}]^{-1}\) and \(m' = C'[\frac{m}{C} + \frac{n}{s^2}\bar y]\).

  • The posterior precision is the sum of the prior and observation precisions
  • The posterior mean is a precision weighted average of the prior and data.
  • Jeffrey prior/posterior are the limits of the conjugate prior/posterior as \(C \rightarrow \infty\), i.e. \(\lim_{C\rightarrow \infty} N(m, C) \stackrel{d}{\rightarrow} \propto 1\) and \(\lim_{C \rightarrow \infty} N(m', C') \stackrel{d}{\rightarrow} N(\bar y, s^2/n)\).

Thm: If \(Y_i \stackrel{iid}{\sim} N(m, \sigma^2)\) (\(m\) is known), Jeffery prior for \(\sigma^2\) is \(p(\sigma^2) \propto 1/\sigma^2\) and the posterior is \[ p(\sigma^2\mid y) \propto (\sigma^2)^{-n/2-1}\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n [y_i - m]^2 \right) \sim IG(n/2, \sum_{i=1}^n [y_i - m]^2). \] The natural conjugate prior is \(\sigma^2 \sim IG(a, b)\) and the posterior \(\sigma\mid y \sim IG(a + n/2, b + \sum_{i=1}^n [y_i - m]^2/2)\).

JAGS
# Y ~ Bin(n, theta), theta ~ Be(1, 1), we observe Y = 3 successes out of 10 attempts
library(rjags)
model = "
model
{
y ~ dbin(theta,n)   # notice p then n
theta ~ dbeta(a,b)
}"

dat = list(n=10, y=3, a=1, b=1)
m = jags.model(textConnection(model), dat)

r = coda.samples (m, "theta", n.iter=1000)
summary(r)
plot(r)
Stan
library(rstan)
model = "
data {
int<lower=0> n; // define range and type
int<lower=0> a; // and notice semicolons
int<lower=0> b;
int<lower=0, upper=n> y;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
y ~ binomial(n, theta) ;
theta ~ beta(a, b) ;
}
"

dat = list(n=10, y=3, a=1, b=1)
m = stan_model(model_code = model) # Only needs to be done once
r = sampling(m, data = dat)
r
## Inference for Stan model: 7f934cc8d471003538635fa74104a81a.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd   2.5%   25%   50%   75% 97.5% n_eff Rhat
## theta  0.33    0.00 0.13   0.11  0.23  0.32  0.42  0.62  1538    1
## lp__  -8.18    0.02 0.75 -10.48 -8.36 -7.90 -7.70 -7.64  1301    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Aug  5 13:54:24 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(r)
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)