6.1 The Gaussian linear model
The Gaussian linear model specifies
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mu
such that \mu \sim N(\mathbf{0}, \sigma^2 \mathbf{I}_N) is a stochastic error, \mathbf{X} is an N \times K matrix of regressors, \boldsymbol{\beta} is a K-dimensional vector of location coefficients, \sigma^2 is the variance of the model (scale parameter), \mathbf{y} is an N-dimensional vector of a dependent variable, and N is the sample size. We describe this model using the conjugate family in 3.3, that is,
\pi(\boldsymbol{\beta},\sigma^2) = \pi(\boldsymbol{\beta} \mid \sigma^2) \times \pi(\sigma^2),
which allows obtaining the posterior marginal distribution for \boldsymbol{\beta} and \sigma^2.
We assume independent priors in this section, that is,
\pi(\boldsymbol{\beta},\sigma^2) = \pi(\boldsymbol{\beta}) \times \pi(\sigma^2),
where \boldsymbol{\beta} \sim N(\boldsymbol{\beta}_0, \mathbf{B}_0) and \sigma^2 \sim IG(\alpha_0/2, \delta_0/2), with \alpha_0/2 and \delta_0/2 as the shape and rate parameters. This setting allows deriving the posterior conditional distributions,
\pi(\boldsymbol{\beta} \mid \sigma^2, \mathbf{y}, \mathbf{X})
and
\pi(\sigma^2 \mid \boldsymbol{\beta}, \mathbf{y}, \mathbf{X}),
which in turn enables the use of the Gibbs sampler algorithm to perform posterior inference on \boldsymbol{\beta} and \sigma^2.
The likelihood function in this model is
\begin{align} p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}) = (2\pi\sigma^2)^{-\frac{N}{2}} \exp \left\{-\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top}(\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right\}. \end{align}
Then, the conditional posterior distributions are
\begin{align} \boldsymbol{\beta} \mid \sigma^2, \mathbf{y}, \mathbf{X} \sim N(\boldsymbol{\beta}_n, \mathbf{B}_n), \end{align}
and
\begin{align} \sigma^2 \mid \boldsymbol{\beta}, \mathbf{y}, \mathbf{X} \sim IG(\alpha_n/2, \delta_n/2), \end{align}
where
\mathbf{B}_n = (\mathbf{B}_0^{-1} + \sigma^{-2} \mathbf{X}^{\top} \mathbf{X})^{-1},
\boldsymbol{\beta}_n= \mathbf{B}_n (\mathbf{B}_0^{-1} \boldsymbol{\beta}_0 + \sigma^{-2} \mathbf{X}^{\top} \mathbf{y}),
\alpha_n = \alpha_0 + N,
\delta_n = \delta_0 + (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^{\top} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}).
This model can be extended to consider heteroskedasticity such that y_i \sim N(\mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2/\tau_i), where \tau_i \sim G(v/2,v/2). See Exercise 2 for details.
Example: The market value of soccer players in Europe
Let’s analyze the determinants of the market value of soccer players in Europe. In particular, we use the dataset 1ValueFootballPlayers.csv, which is in the folder DataApp in our GitHub repository: https://github.com/besmarter/BSTApp. This dataset was used by Serna Rodríguez, Ramírez Hassan, and Coad (2019) to find the determinants of high-performance soccer players in the five most important national leagues in Europe.
The specification of the model is
\begin{align} \log(\text{Value}_i) &= \beta_1 + \beta_2 \text{Perf}_i + \beta_3 \text{Age}_i + \beta_4 \text{Age}^2_i + \beta_5 \text{NatTeam}_i \\ &\quad + \beta_6 \text{Goals}_i + \beta_7 \text{Exp}_i + \beta_8 \text{Exp}^2_i + \mu_i, \end{align}
where Value is the market value in Euros (2017), Perf is a measure of performance, Age is the player’s age in years, NatTeam is an indicator variable that takes the value of 1 if the player has been on the national team, Goals is the number of goals scored by the player during his career, and Exp is his experience in years.
We assume that the dependent variable follows a normal distribution, so we use a normal-inverse gamma model with vague conjugate priors where
\mathbf{B}_0 = 1000 \mathbf{I}_{8}, \quad \boldsymbol{\beta}_0 = \mathbf{0}_{8}, \quad \alpha_0 = 0.001, \quad \delta_0 = 0.001.
We perform a Gibbs sampler with 5,000 MCMC iterations, plus a burn-in of 5,000, and a thinning parameter equal to 1.
Once our GUI is displayed (see the beginning of this chapter), we should follow the next Algorithm to run linear Gaussian models in our GUI (see Chapter 5 for details).
Algorithm: Gaussian linear model
Select Univariate Models on the top panel
Choose the Normal model using the left radio button
Upload the dataset by selecting if there is a header and specifying the separator (comma, semicolon, or tab)
Use the Browse button to select the file and preview the dataset
Adjust MCMC iterations, burn-in, and thinning using the Range sliders
Specify dependent and independent variables using the Formula builder
Click the Build formula button to generate the model formula in R
Modify the formula in the Main equation box if necessary
Set hyperparameters (mean vector, covariance matrix, shape, and scale parameters) if needed
Click the Go! button
Analyze results
Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons
We can see in the following R code examples how to perform the linear Gaussian model using the MCMCregress command from the MCMCpack package, as well as how to program the Gibbs sampler ourselves. We should obtain similar results using all three approaches: GUI, package, and our function. In fact, our GUI relies on the MCMCregress command. For instance, the value of a top soccer player in Europe increases by 134% (\exp(0.85)-1) on average when he has played for the national team, with a 95% credible interval of (86%, 197%).
rm(list = ls())
set.seed(010101)
########################## Linear regression: Value of soccer players ##########################
Data <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/1ValueFootballPlayers.csv", sep = ",", header = TRUE, quote = "")
attach(Data)
y <- log(Value)
# Value: Market value in Euros (2017) of soccer players
# Regressors quantity including intercept
X <- cbind(1, Perf, Age, Age2, NatTeam, Goals, Exp, Exp2)
# Perf: Performance. Perf2: Performance squared. Age: Age; Age: Age squared.
# NatTeam: Indicator of national team. Goals: Scored goals. Goals2: Scored goals squared
# Exp: Years of experience. Exp2: Years of experience squared. Assists: Number of assists
k <- dim(X)[2]
N <- dim(X)[1]
# Hyperparameters
d0 <- 0.001/2
a0 <- 0.001/2
b0 <- rep(0, k)
c0 <- 1000
B0 <- c0*diag(k)
B0i <- solve(B0)
# MCMC parameters
mcmc <- 5000
burnin <- 5000
tot <- mcmc + burnin
thin <- 1
# Posterior distributions using packages: MCMCpack sets the model in terms of the precision matrix
posterior <- MCMCpack::MCMCregress(y~X-1, b0=b0, B0 = B0i, c0 = a0, d0 = d0, burnin = burnin, mcmc = mcmc, thin = thin)
summary(coda::mcmc(posterior))
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## X 3.695499 2.228060 3.151e-02 3.151e-02
## XPerf 0.035445 0.004299 6.079e-05 6.079e-05
## XAge 0.778410 0.181362 2.565e-03 2.565e-03
## XAge2 -0.016617 0.003380 4.781e-05 4.781e-05
## XNatTeam 0.850362 0.116861 1.653e-03 1.689e-03
## XGoals 0.009097 0.001603 2.266e-05 2.266e-05
## XExp 0.206208 0.062713 8.869e-04 8.428e-04
## XExp2 -0.006992 0.002718 3.844e-05 3.719e-05
## sigma2 0.969590 0.076091 1.076e-03 1.076e-03
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## X -0.545746 2.174460 3.653373 5.171463 8.177948
## XPerf 0.026933 0.032570 0.035421 0.038368 0.043817
## XAge 0.419057 0.656975 0.779534 0.902753 1.125442
## XAge2 -0.022967 -0.018928 -0.016651 -0.014366 -0.009919
## XNatTeam 0.627228 0.771339 0.852420 0.928765 1.075360
## XGoals 0.005914 0.007984 0.009108 0.010180 0.012272
## XExp 0.082206 0.164290 0.206742 0.248716 0.329809
## XExp2 -0.012290 -0.008829 -0.007002 -0.005188 -0.001762
## sigma2 0.832320 0.915580 0.965122 1.018776 1.127566
# Posterior distributions programming the Gibbs sampling
# Auxiliary parameters
XtX <- t(X)%*%X
bhat <- solve(XtX)%*%t(X)%*%y
an <- a0 + N
# Gibbs sampling functions
PostSig2 <- function(Beta){
dn <- d0 + t(y - X%*%Beta)%*%(y - X%*%Beta)
sig2 <- invgamma::rinvgamma(1, shape = an/2, rate = dn/2)
return(sig2)
}
PostBeta <- function(sig2){
Bn <- solve(B0i + sig2^(-1)*XtX)
bn <- Bn%*%(B0i%*%b0 + sig2^(-1)*XtX%*%bhat)
Beta <- MASS::mvrnorm(1, bn, Bn)
return(Beta)
}
PostBetas <- matrix(0, mcmc+burnin, k)
PostSigma2 <- rep(0, mcmc+burnin)
Beta <- rep(0, k)
for(s in 1:tot){
sig2 <- PostSig2(Beta = Beta)
PostSigma2[s] <- sig2
Beta <- PostBeta(sig2 = sig2)
PostBetas[s,] <- Beta
}
keep <- seq((burnin+1), tot, thin)
PosteriorBetas <- PostBetas[keep,]
colnames(PosteriorBetas) <- c("Intercept", "Perf", "Age", "Age2", "NatTeam", "Goals", "Exp", "Exp2")
summary(coda::mcmc(PosteriorBetas))
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## Intercept 3.663230 2.194363 3.103e-02 3.103e-02
## Perf 0.035361 0.004315 6.102e-05 6.102e-05
## Age 0.780374 0.178530 2.525e-03 2.525e-03
## Age2 -0.016641 0.003332 4.713e-05 4.713e-05
## NatTeam 0.850094 0.119093 1.684e-03 1.684e-03
## Goals 0.009164 0.001605 2.270e-05 2.270e-05
## Exp 0.205965 0.062985 8.907e-04 8.596e-04
## Exp2 -0.007006 0.002731 3.862e-05 3.701e-05
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## Intercept -0.579087 2.169333 3.651261 5.108860 8.023949
## Perf 0.027018 0.032474 0.035346 0.038165 0.044079
## Age 0.429084 0.662879 0.781856 0.901172 1.126256
## Age2 -0.023079 -0.018883 -0.016662 -0.014410 -0.010018
## NatTeam 0.621226 0.769287 0.848137 0.930096 1.088729
## Goals 0.006026 0.008065 0.009176 0.010249 0.012240
## Exp 0.080559 0.163623 0.206094 0.248598 0.327669
## Exp2 -0.012354 -0.008885 -0.007009 -0.005166 -0.001629
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.973309 0.077316 0.001093 0.001116
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.8361 0.9189 0.9685 1.0228 1.1421