19.1 Comparing Groups with a Hierarchical Model

Example 19.1 Example 17.1 which concerned whether newborns born to mothers who smoke tend to weigh less at birth than newborns from mothers who don’t smoke.

Assume birthweights follow a Normal distribution with mean \(\mu_1\) for nonsmokers and mean \(\mu_2\) for smokers, and standard deviation \(\sigma\). (We’re assuming a common standard deviation to simplify a little, but we could also let standard deviation vary by smoking status.)

  1. Explain how you could model this scenario with a hierarchical model. Specify all levels in the hierarchy and the relevant parameters and distributions at each level.
  2. Use simulation to approximate the joint and marginal prior distributions of \(\mu_1\) and \(\mu_2\), and the marginal prior distribution of \(\mu_1-\mu_2\). How does the prior compare to one in Example 19.1?
  3. Use JAGS to fit the model to the data. Summarize the joint and marginal posterior distributions of \(\mu_1\) and \(\mu_2\), and the marginal posterior distribution of \(\mu_1-\mu_2\), and of the effect size. How does the posterior compare the one in Example19.1?
  4. Do the conclusions change substantially compared to those in 19.1?
Solution. to Example 19.1
  1. As in Example 19.1 we assume that given \(\mu_1, \mu_2, \sigma\), individual birthweights are Normally distributed with a mean that depends on smoking status: \(Y_{i, x}\sim N(\mu_x, \sigma)\). \(\sigma\) represents the baby-to-baby variability in birthweights. We could have separate \(\sigma\)’s for each smoking status. Our main reason for not doing so is simplicity, and to be consistent with Example 19.1. Also to be consistent with Example 19.1 we’ll assume \((1/\sigma^2)\) has a Gamma(1, 1) prior distribution (though this isn’t the greatest choice.) Now we assume that given \(\mu_0, \sigma_\mu\), the group means \(\mu_1\) and \(\mu_2\) are i.i.d. \(N(\mu_0, \sigma_\mu)\). \(\mu_0\) represents the overall average birthweight, and \(\sigma_\mu\) represents how much the mean birthweights would vary from group to group. The hyperparameters are \(\mu_0\) and \(\sigma_\mu\). The prior on \(\mu_0\) represents prior uncertainty about the overall average birthweight. We’ll assume a Normal distribution prior for \(\mu_0\) with prior mean 7.5 and prior SD 0.5, consistent with to be consistent with Example 19.1. \(\sigma_mu\) represents how much means vary from group to group. Recall that in Example we used a prior that gave rather large credibility to no difference in means. To be somewhat consistent we’ll use an informative prior46 which places large credibility on \(\sigma_\mu\) being small, Gamma(1, 3).

  2. While the prior distribution isn’t the same as in Example 19.1, there is still a strong prior credibility of no difference. However, we see that the prior distribution of \(\mu_1-\mu_2\) has heavier tails here than it did in Example 19.1.

    N = 10000
    
    # hyperprior
    mu0 = rnorm(N, 7.5, 0.5)
    sigma_mu = rgamma(N, 1, 3)
    
    # prior
    mu1 = rnorm(N, mu0, sigma_mu)
    mu2 = rnorm(N, mu0, sigma_mu)
    
    theta_sim_prior = data.frame(mu0, mu1, mu2)
    
    hist(theta_sim_prior$mu1 - theta_sim_prior$mu2, xlab = "mu1 - mu2")
    
    ggplot(theta_sim_prior, aes(mu1, mu2)) +
      geom_point(color = "skyblue", alpha = 0.4) +
      geom_density_2d(color = "orange", size = 1) +
      geom_abline(intercept = 0, slope = 1, color="black", 
                 linetype="dashed")

  3. The posterior is pretty similar to the one in Example19.1. The min difference is that the posterior for the hierarchical model gives a little more credibility to a large difference in means.

    # data
    data = read.csv("_data/baby_smoke.csv")
    
    y = data$weight
    
    x = (data$habit == "smoker") + 1
    
    n = length(y)
    
    n_groups = 2
    
    # Model
    model_string <- "model{
    
      # Likelihood
      for (i in 1:n){
      y[i] ~ dnorm(mu[x[i]], tau)
      }
    
      # Prior
      for (j in 1:n_groups){
      mu[j] ~ dnorm(mu0, 1 / sigma_mu ^ 2)
      }
    
      sigma <- 1 / sqrt(tau)
      tau ~ dgamma(1, 1)
    
      # Hyperprior
      mu0 ~ dnorm(7.5, 0.5)
      sigma_mu ~ dgamma(1, 3)
    
    }"
    
    dataList = list(y = y, x = x, n = n, n_groups = n_groups)
    
    # Compile
    model <- jags.model(textConnection(model_string),
                    data = dataList,
                    n.chains = 5)
    ## Compiling model graph
    ##    Resolving undeclared variables
    ##    Allocating nodes
    ## Graph information:
    ##    Observed stochastic nodes: 999
    ##    Unobserved stochastic nodes: 5
    ##    Total graph size: 2014
    ## 
    ## Initializing model
    # Simulate
    update(model, 1000, progress.bar = "none")
    
    Nrep = 10000
    
    posterior_sample <- coda.samples(model,
                                 variable.names = c("mu", "tau", "sigma", "mu0", "sigma_mu"),
                                 n.iter = Nrep,
                                 progress.bar = "none")
    
    # Summarize and check diagnostics
    summary(posterior_sample)
    ## 
    ## Iterations = 2001:12000
    ## Thinning interval = 1 
    ## Number of chains = 5 
    ## Sample size per chain = 10000 
    ## 
    ## 1. Empirical mean and standard deviation for each variable,
    ##    plus standard error of the mean:
    ## 
    ##            Mean      SD   Naive SE Time-series SE
    ## mu[1]    7.1339 0.05120 0.00022896     0.00030865
    ## mu[2]    6.9050 0.13763 0.00061550     0.00138229
    ## mu0      7.0355 0.26165 0.00117013     0.00130730
    ## sigma    1.5035 0.03356 0.00015009     0.00015150
    ## sigma_mu 0.2898 0.24164 0.00108063     0.00370578
    ## tau      0.4430 0.01977 0.00008843     0.00008926
    ## 
    ## 2. Quantiles for each variable:
    ## 
    ##             2.5%    25%    50%    75%  97.5%
    ## mu[1]    7.03411 7.0991 7.1336 7.1684 7.2351
    ## mu[2]    6.62959 6.8105 6.9079 7.0057 7.1533
    ## mu0      6.49627 6.9189 7.0453 7.1447 7.5891
    ## sigma    1.43919 1.4807 1.5030 1.5259 1.5708
    ## sigma_mu 0.01636 0.1273 0.2266 0.3816 0.9273
    ## tau      0.40530 0.4295 0.4427 0.4561 0.4828
    plot(posterior_sample)

    theta_sim_posterior = as.matrix(posterior_sample)
    head(theta_sim_posterior)
    ##         mu[1]    mu[2]      mu0    sigma  sigma_mu       tau
    ## [1,] 7.118449 7.003762 7.056282 1.460830 0.2253755 0.4685985
    ## [2,] 7.186183 6.663398 6.928209 1.461622 0.2231579 0.4680904
    ## [3,] 7.145686 6.996485 7.266448 1.431581 0.2255716 0.4879416
    ## [4,] 7.168712 7.013768 7.045537 1.506650 0.2429123 0.4405299
    ## [5,] 7.134420 6.810429 7.060628 1.500137 0.1177805 0.4443634
    ## [6,] 7.123729 6.895591 6.849835 1.509891 0.2858626 0.4386406
    theta_sim_posterior = as.data.frame(theta_sim_posterior)
    names(theta_sim_posterior)[c(1, 2)] = c("mu1", "mu2")
    
    ggplot(theta_sim_posterior, aes(mu1, mu2)) +
      geom_point(color = "skyblue", alpha = 0.4) +
      geom_density_2d(color = "orange", size = 1) +
      geom_abline(intercept = 0, slope = 1, color="black", 
                 linetype="dashed")

    cor(theta_sim_posterior$mu1, theta_sim_posterior$mu2)
    ## [1] -0.02514047
    mu_diff = theta_sim_posterior$mu1 - theta_sim_posterior$mu2
    
    hist(mu_diff, xlab = "mu1 - mu2")
    abline(v = quantile(mu_diff, c(0.025, 0.975)), lty = 2)

    quantile(mu_diff, c(0.025, 0.975))
    ##        2.5%       97.5% 
    ## -0.02419377  0.52482421
    mean(mu_diff)
    ## [1] 0.228875
    sum(mu_diff > 0 ) / 50000
    ## [1] 0.94582
    sigma = 1 / sqrt(theta_sim_posterior$tau)
    
    effect_size = mu_diff / sigma
    
    hist(effect_size, xlab = "effect size")
    abline(v = quantile(effect_size, c(0.95)), lty = 2)

    quantile(effect_size, c(0.95))
    ##       95% 
    ## 0.3187092
    mean(effect_size)
    ## [1] 0.152358
  4. The posterior distribution in the hierarchical model gives a little more credibility to a large difference in means. However, there are no major differences and the conclusions do not change substantially compared to those in 19.1.


  1. I did test a few different priors for \(\sigma_\mu\) and the other parameters, and the results are not very sensitive to choice of prior. For some discussion and recommendations regarding prior distributions for variance parameters, see Section 5.7 of Gelman et. al. Bayesian Data Analysis, 3rd ed.↩︎