3 When \(y\) is categorical: generalized linear model (GLM)

3.1 Categorical dependent variable

Normal regression requires the dependent variable (DV) \(y\) to be continuous. However, \(y\) can be categorical. To accurately model categorical \(y\), we need new regression model.

3.2 Basic logic of modeling categorical \(y\)

In normal regression, we have

\[y_i=\beta_0+\beta_1x_{1i}+\cdots+\beta_px_{pi}+\epsilon_i,\]

where \(\epsilon_i\in N(0,\sigma^2)\). Since \(\epsilon_i\) is continuous, \(y_i\) must be continuous. If we model categorical DV using normal regression, treating categorical \(y\) as continuous, it will definitely lead to a very poor model fit.

No matter what the nature of DV is (discrete or continuous), we are interested in the linear relationship between \(y\) and background variables \(\boldsymbol{x}\). This means that the regression model for discrete \(y\) has a right hand side (\(\beta_0+\beta_1x_{1i}+\cdots+\beta_px_{pi}\)) similar to that of the regression model for continuous \(y\), they only differ in the left hand side.

Because multiple regression requires its left hand side to be continuous, we have to transform (one-one mapping, i.e. monotonic transformation) categorical \(y\) into continuous one, \(y^*\), and put \(y^*\) in the left hand side instead. This contains two steps:

model probability: treat categorical \(y\) as observations of an discrete random variable with probability mass function, and modeling their probabilities instead;
convert probability: convert probability \((0,1)\) into fully continuous scale \((-\infty,+\infty)\) using link function.

3.2.1 Probability behind categorical \(y\)

We treat categorical \(y\) as observations of a discrete random variable \(Y\) with \(K\) particular values, it follows

\[\begin{cases}P(Y=y_1)\\P(Y=y_2)\\\cdots\\P(Y=y_K)\\\end{cases}.\]

Now we only need to transform ranged continuous \(P\)s into unranged continuous \(y^*\), this is realized through link function.

3.2.2 Link function

The general form of link function is \(f(y)=y^*\), the resultant regression model is \[f[P(Y=y_k)]=y^*_k=\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p.\]

The subscript \(k\) implies that there will be \(K-1\) regression models, because we have \(K\) \(P\)s and their summation must be 1. If \(y\) is dichotomous (1 or 0), \[\begin{align*} f[P(Y=1)]=y^*_1&=\beta_0+\beta_1x_1+\cdots+\beta_px_p\\ P(Y=1)&=f^{-1}(\beta_0+\beta_1x_1+\cdots+\beta_px_p), \end{align*}\] then \[P(Y=0) = 1 - f^{-1}(\beta_0+\beta_1x_1+\cdots+\beta_px_p).\] That is, we only need 1 regression model for dichotomous \(y\).
If these \(K-1\) models fit the data well, the predicted probabilities will match the observed \(y\) closely as a whole. \(K-1\) regression models can be computationally complex in high dimension, especially for the nominal probit regression.

The interpretation of regression coefficients is similar to what we’ve learned in general linear model, the only difference is that the predicted variable is now \(y^*\) instead of \(y\).

Unlike multiple regression, there is no error term in the resultant regression model for categorical \(y\). This is echoed in many text,

to name a few. Readers interested in a more detailed explanation are refereed to Hosmer et al. (2013), page 7.

However, there also exists a statement that regression models for categorical \(y\) contain error terms, the resultant regression models have latent variables and are called threshold models. Mplus uses this latent variable framework (Muthén, 1998-2004).

Actually, these two seemingly conflicting statements unite, regression models without error terms and threshold models are equivalent (see Section 3.4.1.1 for more detail). But for consistency, we still use the ones with no error term through this chapter.

3.3 Why distinguishing between nominal and ordinal?

TODO: Need an example, could be simulation or mathematical derivation

Categorical \(y\) can be either nominal or ordinal. Ordinal variable contains information about trend. By using this ordinal nature of the variable, we can narrow down the scope of alternative hypotheses and increase the power of statistical tests. Models with terms that reflect ordinal characteristics such as monotone trend have improved model parsimony and power. That is why people tend to treat nominal as ordinal.

3.4 When \(y\) is nominal

3.4.1 Nominal probit regression

Because we need a link function that can transform \(P(Y=y) \in(0,1)\) into \(y^*\in(-\infty, +\infty)\), and vice versa. Actually, there exists a well-known function that meet this requirement, i.e. the cumulative distribution function of standard normal distribution:

\[\phi(z)=\int^{z}_{-\infty}f(z)dx=P(Z<z),\]

where \(z\in N(\mu,\sigma^2)\), lies in \((-\infty, +\infty)\), \(f(z)\) is the probability density function, and \(\phi(z)\in(0,1)\). Thus,

\[\begin{align*} z&=\phi^{-1}[P(Z<z)] \end{align*}\] Now we only need to convert \(P(Y=y_k)\) into \(\phi(z_k)=P(Z_k<z_k)\).

3.4.1.1 Threshold \(\tau\): the nominal probit regression as a latent variable model

The following text is mainly based on the chapter 7.1.1 of Agresti’s book (2013).

To model categorical \(y\), we first target probability \(P(Y=y_k)\) instead of observed \(y\). We then convert \(P(Y=y_k)\) into fully continuous \(y^*_k\) using link function. This is equivalent to a latent variable model assuming that there is an unobserved continuous random variable \(y^*_k\), such that the observed response \(Y\neq y_k\) if \(y^*_k\leq\tau_k\) and \(Y=y_k\) if \(y^*_k>\tau_k\), \[\begin{align*} P(Y=y_k)=P(y^*_k>\tau_k)&=P(\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p+\epsilon_{k}>\tau_k), \end{align*}\] where \(\epsilon_{k}\) are independent from \(N(0,\sigma^2)\). \[\begin{align*} P(y^*_k>\tau_k)&=P(-\epsilon_{k}<\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p-\tau_k)\\ &=P\left(-Z_{\epsilon_k}<\frac{\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p-\tau_k}{\sigma}\right), \end{align*}\] where \(-\epsilon_k\) has the same distribution as \(\epsilon_k \in N(0,\sigma^2)\), so do \(-Z_{\epsilon_k}\) and \(Z_{\epsilon_k} \in N(0,1)\). Now we successfully convert \(P(Y=y_k)\) into \(\phi(z_k)=P(Z_k<z_k)\), where \(\phi\) is cumulative density function of standard normal distribution, \(Z_k=-Z_{\epsilon_k}\), and \(z_k=(\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p-\tau)/(\sigma_{\epsilon_{k}})\). To make this model identifiable, it is common to arbitrarily set \(\sigma=1\) and \(\tau_k=0\), resulting \[\begin{align*} P(y^*_k>0)&=P(-Z_{\epsilon_k}<\beta_{0k}+\beta_{1k}x_1+\ldots+\beta_{pk}x_p)\\ &=\phi(\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p), \end{align*}\] this is why we do not need to estimate \(\sigma\) for probit regression as the case in normal regression. Stats tool with such setting (i.e. the glm function in R) will report estimated intercepts, but no estimated thresholds.

Some statistical software does not include intercept in the right hand side, therefore \[\begin{align*} P(y^*_k>\tau_k)&=P(\beta_{1k}x_1+\cdots+\beta_{pk}x_p+\epsilon_{k}>\tau_k)\\ &=P(-\epsilon_{k}<-\tau_k+\beta_{1k}x_1+\cdots+\beta_{pk}x_p)\\ &=P\left(-Z_{\epsilon_k}<\frac{-\tau_k+\beta_{1k}x_1+\cdots+\beta_{pk}x_p}{\sigma}\right)\\ &=\phi(-\tau_k+\beta_{1k}x_1+\cdots+\beta_{pk}x_p), \end{align*}\] then \(-\tau_k\) becomes the intercepts, we only need to set \(\sigma=1\) to make this model identifiable. Stats tool with such setting (i.e. Mplus) will report estimated thresholds, but no estimated intercepts.

3.4.2 Nominal logistic regression

Nominal logistic regression, aka baseline-category logistic regression uses another link function, log odd.

For a discrete random variable \(Y\) with \(K\) values, the odd of \(y_1\) is defined as

\[Odd[P(Y=y_{1})]=\frac{P(Y=y_1)}{P(Y=y_K)},\]

where \(P(Y=y_K)\) is reference probability of baseline-category. We usually use the last category or the most common one as the baseline. This is an arbitrary choice.

Odd use the reference probability as unit to measure \(P(Y=y_1)\). It is easy to see that \(Odd\in(0,+\infty)\), therefore we convert probability \((0,1)\) into \((0,+\infty)\), the last step is further convert Odd into \((-\infty,+\infty)\). Natural logarithm perfectly fulfill the need. The resulting logistic regression is \[\begin{align*} \log\left[\frac{P(Y=y_1)}{P(Y=y_K)}\right]&=\beta_{01}+\beta_{11}x_1+\cdots+\beta_{p1}x_p\\ P(Y=y_1)&=P(Y=y_K)\exp(\beta_{01}+\beta_{11}x_1+\cdots+\beta_{p1}x_p)\\ &=P(Y=y_K)\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_1). \end{align*}\] It is easy to see that \[\begin{align*} P(Y=y_2)&=P(Y=y_K)\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_2)\\ &\quad\cdots\\ P(Y=y_{K-1})&=P(Y=y_K)\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_{K-1}). \end{align*}\] Because \(\sum_{k-1}^{K}P(Y=y_k)=1\), we have \[\begin{align*} P(Y=y_K)[\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_1)+\cdots+\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_{K-1})]+P(Y=y_K)=1, \end{align*}\] thus \[\begin{align*} P(Y=y_K)&=\frac{1}{1+\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_1)+\cdots+\exp([1,\boldsymbol{x}']\boldsymbol{\beta}_{K-1})}. \end{align*}\]

For dichotomous \(y\), \[Odd[P(Y=1)]=\frac{P(Y=1)}{1- P(Y=1)}.\]

The regression model is \[\begin{align*} \log\left[\frac{P(Y=1)}{1-P(Y=1)}\right]&=\beta_0+\beta_1x_1+\cdots+\beta_px_p\\ \frac{P(Y=1)}{1-P(Y=1)}&=\exp(\beta_0+\beta_1x_1+\cdots+\beta_px_p)\\ P(Y=1)&=[1-P(Y=1)]\exp(\beta_0+\beta_1x_1+\cdots+\beta_px_p)\\ P(Y=1)&=\frac{\exp(\beta_0+\beta_1x_1+\cdots+\beta_px_p)}{1+\exp(\beta_0+\beta_1x_1+\cdots+\beta_px_p)}\\ &=\frac{1}{1+\exp(-\beta_0-\beta_1x_1-\cdots-\beta_px_p)}. \end{align*}\]

Both the nominal logistic regression and probit regression can be estimated through maximum likelihood, the detail is omitted here.

3.4.2.1 Threshold \(\tau_k\): the norminal logistic regression as a latent variable model

Similar to nominal probit regression, nominal logistic regression can also be viewed as a latent variable model. The only difference is that \(\epsilon_{k}\) are now independently from standardize logistic distribution \[\begin{align*} P(Y=y_k)=P(y^*_k>\tau_k)&=P(\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p+\epsilon_{k}>\tau_k)\\ &=P(\frac{-\epsilon_{k}}{\sigma}<\frac{-\tau_k+\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p}{\sigma})\\ &=G(\beta_{0k}+\beta_{1k}x_1+\cdots+\beta_{pk}x_p-\tau_k), \end{align*}\] where \(\tau_k=0\) for identifiability, \(G\) is the cumulative density function of standardize logistic distribution. Although a standard logistic distribution has \(\mu=0\) and \(\sigma^2=\pi^2/3\)，\(\sigma^2\) only will affect the scaling of parameters but not the predicted probabilities, we could simply set \(\sigma^2=1\).

3.4.2.2 How to simulate data for nominal logistic regression (optional)

Because we model probabilities \(P(Y=y_k)\) directly, where \(k=1,\cdots,K\), we are actually assume that the observed data follow multinomial distribution with given probability mass function. Therefore, to simulate data for baseline-category logistic regression, we can use the following steps (assume the \(K\)th category is the baseline):

Specify the number of categories \(K\) and the regression coefficients \(\boldsymbol{\beta}_k\) for each category \(k=1,2,\cdots,K-1\).
For each observation, calculate the \(\log Odd_{ki}\) as \(\beta_{0k} + \beta_{1k}x_{1i} + \cdots + \beta_{pk}x_{pi}\) for each category \(k=1,2,\cdots,K-1\).
Compute the exponentiated \(\log Odd\)s as \(\exp (\log Odd_{ki})\) for each category \(k=1,2,\cdots,K-1\).
Calculate the probability of each category as follows:

\(P(Y=y_K) = \frac{1}{1 + \sum_{k=1}^{K-1}\exp{(\log Odd_{ki}})}\),
\(P(Y=y_k) = P(Y=y_K)\exp(\log Odd_{ki}) \quad \text{for } k=1,2,\cdots,K-1\).

Generate a random number \(u\) from a uniform distribution \(U(0,1)\) for each observation.
Assign the category \(Y\) based on the cumulative probabilities:

If \(u < P(Y=y_1)\), then \(Y = y_1\).
If \(P(Y=y_1) \leq u < P(Y=y_1) + P(Y=y_2)\), then \(Y = y_2\).
Continue this process until the last category \(y_K\).

or simulate random number directly from a multinomin al distribution with the calculated probabilities.

set.seed(123)
n <- 1000
K <- 3 # number of categories
x1 <- rnorm(n)
x2 <- rnorm(n)
beta_0 <- c(-1, 0.5)
beta_1 <- c(0.5, -0.2)
beta_2 <- c(0.3, 0.4)
# log odds
log_odds_1 <- beta_0[1] + beta_1[1]*x1 + beta_2[1]*x2
log_odds_2 <- beta_0[2] + beta_1[2]*x1 + beta_2[2]*x2
p_y3 <- 1 / (1 + exp(log_odds_1) + exp(log_odds_2))
p_y1 <- p_y3 * exp(log_odds_1)
p_y2 <- p_y3 * exp(log_odds_2)
# sample y based on probabilities
# method 1, compare random number from uniform distribution with probabilities
y <- sapply(1:n, \(i) {
  u <- runif(1)
  if (u < p_y1[i]) {
    return(1)
  } else if (u < p_y1[i] + p_y2[i]) {
    return(2)
  } else {
    return(3)
  }
})
# or method 2, generate random number from multinomial distribution with probabilities
# y <- apply(cbind(p_y1, p_y2, p_y3), 1, \(p) {
#   sample(1:K, size = 1, prob = p)
# })
df_sim <- data.frame(y = factor(y), x1 = x1, x2 = x2)
head(df_sim)

3.4.2.3 Odds ratio of nominal logistic regression

Odds ratio in nominal logistic regression can be interpreted as the effect of one unit of change in \(x_j\) in the predicted odds ratio while holding other background variables at constant, \[\begin{align} OR[P(Y=y_1)|_{x_j}]=\frac{Odd[P(Y=y_1|x_j+1)]}{Odd[P(Y=y_1|x_j)]}. \end{align}\] Because \[\begin{align*} Odd[P(Y=y_1)]&= \frac{P(Y=y_1)}{P(Y=y_k)}\\ &= \frac{P(Y=y_k)\exp(\beta_{01}+\beta_{11}x_1+\cdots+\beta_{p1}x_p)}{P(Y=y_k)}\\ &=\exp(\beta_{01}+\beta_{11}x_1+\cdots+\beta_{p1}x_p), \end{align*}\] then \[\begin{align*} OR[P(Y=y_1)|_{x_j}]&=\frac{\exp[\beta_{01}+\beta_{11}x_1+\ldots+\beta_{j1}(x_j+1)+\ldots+\beta_{p1}x_p]}{\exp(\beta_{01}+\beta_{11}x_1+\ldots+\beta_{j1}x_j+\ldots+\beta_{p1}x_p)}\\ &=\exp(\beta_{j1}). \end{align*}\]

3.4.2.4 Visualization of probabilities: nominal logistic regression

Following is the visualization of the nominal logistic regression model of ex.3.6.

# individual category probability
n_x <- 10000
x <- seq(-4, 4, length.out = n_x)
n_row <- sqrt(n_x)
x1 <- x3 <- x
p_u1_2 <- 1/(
  1 + exp(-0.749 + 0.769*x1 + 2.259*x3) + 
    exp(0.262 + 0.280*x1 + 0.885*x3)
)
p_u1_0 <- p_u1_2*exp(-0.749 + 0.769*x1 + 2.259*x3)
p_u1_1 <- p_u1_2*exp(0.262 + 0.280*x1 + 0.885*x3)
p <- cbind(p_u1_0, p_u1_1, p_u1_2)
plot_ly(x = x1, y = x3) |>
  add_surface(z = matrix(p[, 1], nrow = n_row, ncol = n_row)) |> 
  hide_colorbar() |>
  add_surface(
    z = matrix(p[, 2], nrow = n_row, ncol = n_row), 
    opacity = 0.8
  ) |>
  hide_colorbar() |>
  add_surface(
    z = matrix(p[, 3], nrow = n_row, ncol = n_row), 
    opacity = 0.2
  )

p_cum <- t(sapply(1:10000, \(x) {cumsum(p[x, 1:2])}))
plot_ly(x = x1, y = x3) |>
  add_surface(
    z = matrix(p_cum[, 1], nrow = n_row, ncol = n_row)
  ) |>
  hide_colorbar() |>
  add_surface(
    z = matrix(p_cum[, 2], nrow = n_row, ncol = n_row), 
    opacity = 0.2
  )

3.5 When \(y\) is ordinal

3.5.1 Culmulative probability

Because the categories of \(y\) now have order, for a K-level ordinal \(y\), \[\begin{align*} \begin{cases} y=1,\text{ if }y^*\leq\tau_1\\ y=2,\text{ if }\tau_1< y^*\leq\tau_2\\ \cdots\\ y=K,\text{ if }\tau_{K-1}<y^*. \end{cases} \end{align*}\] The cumulative probability is \[P(Y\le y_j)=P(Y=y_1)+\cdots+P(Y=y_j).\]

3.5.2 Ordinal probit regression

In nominal case, \(P(Y=y_k)\) is treated as cumulative probability of a standard normal random variable \(Z_k\). The modeled probability is now cumulative already, we are effectively treating \(P(Y\leq y_1),P(Y\leq y_2),\cdots,P(Y\leq y_k)\) as cumulative probabilities of a common random variable \(Z\), therefore we have a more parsimonious model:

\[\phi^{-1}[P(Y\leq y_k)]=y^*=\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p.\]

3.5.2.1 Threshold \(\tau_k\): the ordinal probit regression as a latent variable model

When \(y\) is ordinal, we then model \(P(Y\leq y_j)\) as the cumulative probability of a latent variable \(y^*\) with \(\tau_k\) as \[\begin{align*} \phi^{-1}[P(Y\leq y_k)]&=\phi^{-1}[P(y^*<\tau_k)] \\ P(y^*<\tau_k)&=\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p+\epsilon_k<\tau_k, \end{align*}\] where \(\epsilon_k\in N(0,\sigma^2)\), \[\begin{align*} P(y^*<\tau_k)&=P(\epsilon_k<\tau_k-\beta_{0k}-\beta_1x_1-\cdots-\beta_px_p)\\ &=P(\frac{\epsilon_k}{\sigma}<\frac{\tau_k-\beta_{0k}-\beta_1x_1-\cdots-\beta_px_p}{\sigma}). \end{align*}\] To make this model identifiable, let \(\sigma=1\) and \(\tau_k=0\), then \[\begin{align*} P(y^*<\tau_k)=P(Z_{\epsilon_k}<-\beta_{0k}-\beta_1x_1-\cdots-\beta_px_p). \end{align*}\] It is easy to see that, the signs of all regression coefficients of ordinal probit regression are negative.

In statistical software (i.e. Mplus) modeling ordinal probit model without an intercept, \(\tau_k\) replaces the intercept \[\begin{align*} P(y^*<\tau_k)=P(Z_{\epsilon_k}<\tau_k-\beta_1x_1-\cdots-\beta_px_p). \end{align*}\]

\(\tau_k\) in ordinal logistic regression is similar to that in ordinal probit regression, thus is omitted.

3.5.3 Ordinal logistic regression

The odd of ordinal logistic regression becomes

\[Odd[P(Y\leq y_k)]=\frac{P(Y\leq y_k)}{1 - P(Y\leq y_K)}.\]

A model for \(\log odd[P(Y\leq y_k)]\) alone is an ordinary logistic model for a binary response in which categories 1 to \(k\) form one outcome and categories \(k+1\) to \(K\) form the second. The model that simultaneously uses all \(K-1\) cumulative \(\log Odd\)s in a single parsimonious model is \[\begin{align*} \log\left[\frac{P(Y\leq y_k)}{1 - P(Y\leq y_k)}\right]&=\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p.\\ \end{align*}\] Each cumulative \(\log Odd\) has its own intercept. \(\beta_{0k}\) increases in \(k\), because \(P(Y\leq y_k)\) increases in \(k\) given \(\boldsymbol{x}\) and the \(\log odd[P(Y\leq y_k)]\) is an increasing function of \(P(Y\leq y_k)\).

The model above assumes slopes remian the same for each category. This is also called parallel slopes assumption, aka proportional odds assumption, and can be seen clearly in Section 3.5.3.3.

It is easy to see that \[\begin{align*} \frac{P(Y\leq y_k)}{1-P(Y\leq y_k)}&=\exp(\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p)\\ P(Y\leq y_k)&=\frac{\exp(\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p)}{1+\exp(\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p)}\\ 1-P(Y\leq y_k)&=\frac{1}{\exp(\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p)}\\ Odd[P(Y\leq y_k)]&=\exp(\beta_{0k}+\beta_1x_1+\cdots+\beta_px_p). \end{align*}\]

To calculate the individual category probabilities, we have \[\begin{align*} P(Y=y_1)&=P(Y\leq y_1)\\ P(Y=y_2)&=P(Y\leq y_2)-P(Y\leq y_1)\\ &\quad\cdots\\ P(Y=y_K)&=1-P(Y\leq y_{K-1}). \end{align*}\]

3.5.3.1 Odd ratio of ordinal logistic regression

Similar to nominal logistic regression, we have \[\begin{align*} OR[P(Y\leq y_1|x_1))]&=\frac{Odd[P(Y\leq y_1|x_1+1)]}{Odd[P(Y\leq y_1|x_1)]}\\ &=\frac{\exp[\beta_0+\beta_1(x_1+1)+\cdots+\beta_px_p]}{\exp(\beta_0+\beta_1x_1+\cdots+\beta_px_p)}\\ &=\exp(\beta_1). \end{align*}\]

Ordinal Logistic Regression | Mplus Data Analysis Examples

3.5.3.2 How to simulate data for cumulative logistic regression (optional)

To simulate data for cumulative logistic regression, we can use the following steps (assume the \(K\)th category is the baseline):

Specify the number of categories \(K\) and the regression coefficients \(\boldsymbol{\beta}\) for each category \(k=1,2,\cdots,K-1\).
For each observation, calculate the cumulative \(\log Odd_{ki}\) as \(\beta_{0k} + \beta_{1}x_{1i} + \cdots + \beta_{p}x_{pi}\) for each category \(k=1,2,\cdots,K-1\).
Compute \(\exp (\log Odd_{ki})\) for each category \(k=1,2,\cdots,K-1\).
Calculate the cumulative probability of each category as follows:

\(P(Y\leq y_K) = 1\),
\(P(Y\leq y_k) = \frac{\exp(\log Odd_{ki})}{1 + \exp(\log Odd_{ki})}\quad \text{for } k=1,2,\cdots,K-1\).

Generate a random number \(u\) from a uniform distribution \(U(0,1)\) for each observation.
Assign the category \(Y\) based on the cumulative probabilities:

If \(u < P(Y\leq y_1)\), then \(Y = y_1\).
If \(P(Y\leq y_1) \leq u < P(Y\leq y_2)\), then \(Y = y_2\).
Continue this process until the last category \(y_K\).

set.seed(123)
n <- 1000
K <- 3 # number of categories
x1 <- rnorm(n)
x2 <- rnorm(n)
beta_0 <-c(-1, 0.5)
beta_1 <- 0.5
beta_2 <- 0.3
# cumulative log odds
log_odds_1 <- beta_0[1] + beta_1*x1 + beta_2*x2
log_odds_2 <- beta_0[2] + beta_1*x1 + beta_2*x2
p_y1 <- exp(log_odds_1) / (1 + exp(log_odds_1))
p_y2_cum <- exp(log_odds_2) / (1 + exp(log_odds_2))
p_y2 <- p_y2_cum - p_y1
p_y3 <- 1 - p_y2_cum
# sample y based on cumulative probabilities
y <- sapply(1:n, \(i) {
  u <- runif(1)
  if (u < p_y1[i]) {
    return(1)
  } else if (u < p_y2[i] & u >= p_y1[i]) {
    return(2)
  } else {
    return(3)
  }
})
# or simulate random number from multinomial distribution with probabilities
# y <- apply(cbind(p_y1, p_y2, p_y3), 1, \(p) {
#   sample(1:K, size = 1, prob = p)
# })
df_sim <- data.frame(y = factor(y), x1 = x1, x2 = x2)
head(df_sim)

3.5.3.3 Visualization of probabilities: ordinal logistic regression

Following is the visualization of the ordinal logistic regression model of ex.3.6.

# the polr function of MASS package generates identical results as Mplus
df <- read.table("data/ex3.6.dat", col.names = c("u1", "x1", "x3"))
df$u1 <- factor(df$u1)
model_ex3.6 <- polr(u1 ~ x1 + x3, data = df, method = "logistic", Hess = TRUE)
summary(model_ex3.6)

#> Call:
#> polr(formula = u1 ~ x1 + x3, data = df, Hess = TRUE, method = "logistic")
#> 
#> Coefficients:
#>     Value Std. Error t value
#> x1 -0.481    0.09423  -5.105
#> x3 -1.339    0.11521 -11.622
#> 
#> Intercepts:
#>     Value    Std. Error t value 
#> 0|1  -1.4292   0.1244   -11.4866
#> 1|2   0.7587   0.1093     6.9395
#> 
#> Residual Deviance: 878.8875 
#> AIC: 886.8875

n_x <- 10000
x <- seq(-4, 4, length.out = n_x)
n_row <- sqrt(n_x)
x1 <- x3 <- x
exp_u1_0 <- exp(-1.429 + 0.481*x1 + 1.339*x3)
exp_u1_1 <- exp(0.759 + 0.481*x1 + 1.339*x3)
p_u1_0 <- exp_u1_0/(1 + exp_u1_0)
p_u1_1_cum <- exp_u1_1/(1 + exp_u1_1)
p_u1_1_ind <- p_u1_1_cum - p_u1_0
p_u1_2_ind <- 1 - p_u1_0 - p_u1_1_ind
p_ind <- cbind(p_u1_0, p_u1_1_ind, p_u1_2_ind)
p_cum <- cbind(p_u1_0, p_u1_1_cum)
plot_ly(x = x1, y = x3) |>
  add_surface(
    z = matrix(p_cum[, 1], nrow = n_row, ncol = n_row)
  ) |>
  hide_colorbar() |>
  add_surface(
    z = matrix(p_cum[, 2], nrow = n_row, ncol = n_row), 
    opacity = 0.2
  )

plot_ly(x = x1, y = x3) |>
  add_surface(
    z = matrix(p_ind[, 1], nrow = n_row, ncol = n_row)
  ) |>
  hide_colorbar() |>
  add_surface(
    z = matrix(p_ind[, 2], nrow = n_row, ncol = n_row), 
    opacity = 0.8
  ) |>
  hide_colorbar() |>
  add_surface(
    z = matrix(p_ind[, 3], nrow = n_row, ncol = n_row), 
    opacity = 0.2
  )

3.6 Interaction effect

Both nominal and ordinal regression models (probit or logistic) can include interaction terms among background variables. The incorporation interaction effect is similar to what we’ve learned in Section 1.3. The only difference is that now we study the effect of product term on \(y^*\), instead of the original \(y\).

To make this section more compact, we only use logsitic regression (nominal and ordinal) as example in this subsection.

3.6.1 When \(x\)s are continuous

3.6.1.1 Nominal logistic regression

For nominal logistic regression with a 3-level categorical y, 2 continuous \(x\)s and interaction term, we have \[\begin{align*} \log\left[\frac{P(Y=y_k)}{P(Y=y_K)}\right]&=\beta_{0k}+\beta_{1k}x_1+\beta_{2k}x_2+\beta_{3k}x_1x_2, \end{align*}\] where \(k=1,\ldots,K-1\). If \(\beta_{3k}\) is significant, there exists interaction effect between \(x_1\) and \(x_2\) on the log odd of \(P(Y=y_k)\) vs \(P(Y=y_K)\).

# Simulate data, x1 and x2 are continous, y is a 3-level nominal variable
# because nnet::multinom function use the 1st level of factor y as the reference,
# when simulating data, we have to match this settings
set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
# Coefficients
beta_0 <- c(-1, 0.5)
beta_1 <- c(0.8, -0.5)
beta_2 <- c(0.6, 0.3)
beta_3 <- c(1.2, -1)
# Calculate probabilities
log_odds_2 <- beta_0[1] + beta_1[1]*x1 + beta_2[1]*x2 + beta_3[1]*x1*x2
log_odds_3 <- beta_0[2] + beta_1[2]*x1 + beta_2[2]*x2 + beta_3[2]*x1*x2
# Convert log odds to probabilities
exp_log_odds_2 <- exp(log_odds_2)
exp_log_odds_3 <- exp(log_odds_3)
p_y1 <- 1/(1 + exp_log_odds_2 + exp_log_odds_3)
p_y2 <- p_y1*exp_log_odds_2
p_y3 <- p_y1*exp_log_odds_3
# Simulate categorical outcome
y <- apply(cbind(p_y1, p_y2, p_y3), 1, \(p) {
  sample(1:3, size = 1, prob = p)
})
data <- data.frame(y = factor(y), x1 = x1, x2 = x2)
write.table(
  data, file = "data/1y nominal 2xs continuous.txt", 
  row.names = FALSE, col.names = FALSE
)
# Fit nominal logistic regression model
model_nominal_x_continuous <- multinom(y ~ x1 * x2, data = data)

#> # weights:  15 (8 variable)
#> initial  value 1098.612289 
#> iter  10 value 779.940366
#> final  value 766.026991 
#> converged

summary(model_nominal_x_continuous)

#> Call:
#> multinom(formula = y ~ x1 * x2, data = data)
#> 
#> Coefficients:
#>   (Intercept)         x1        x2     x1:x2
#> 2  -1.1965747  0.8552350 0.7194336  1.377557
#> 3   0.5978433 -0.4946647 0.3361974 -1.046971
#> 
#> Std. Errors:
#>   (Intercept)        x1         x2     x1:x2
#> 2  0.14294216 0.1427677 0.13508648 0.1665136
#> 3  0.08191535 0.1066762 0.09123945 0.1364319
#> 
#> Residual Deviance: 1532.054 
#> AIC: 1548.054

# plot interaction effect, or more specifically, the effect of x1 at different levels of x2
x2_plot <- seq(-3, 3, length.out = 100) 
beta_x1_y2 <- coef(model_nominal_x_continuous)["2", "x1"] + 
  coef(model_nominal_x_continuous)["2", "x1:x2"] * x2_plot
beta_x1_y3 <- coef(model_nominal_x_continuous)["3", "x1"] +
  coef(model_nominal_x_continuous)["3", "x1:x2"] * x2_plot
df_plot <- data.frame(
  x2 = rep(x2_plot, 2),
  beta_x1 = c(beta_x1_y2, beta_x1_y3),
  y_level = factor(rep(c("y=2", "y=3"), each = length(x2_plot)))
)
p <- ggplot(df_plot, aes(x = x2, y = beta_x1, color = y_level)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Effect of x1 on logodd of y=2 and y=3",
    x = "x2",
    y = "Effect (slope) of x1"
  )
ggplotly(p)

3.6.1.2 Ordinal logistic regression

For ordinal logistic regression with 2 continuous \(x\)s and their interaction term, we have \[\begin{align*} \log\left[\frac{P(Y\leq y_k)}{1 - P(Y\leq y_k)}\right]&=\beta_{0k}+\beta_{1}x_1+\beta_{2}x_2+\beta_{3}x_1x_2, \end{align*}\] where \(k=1,\ldots,K-1\). If \(\beta_{3}\) is significant, there exists interaction effect between \(x_1\) and \(x_2\) on the log odd of \(P(Y\leq y_k)\).

# Simulate data, x1 and x2 are continous, y is a 3-level ordinal variable
set.seed(123)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
# Coefficients, note that, only intercepts varuy across K-1 models,
#   slopes remain the same
beta_0 <- c(-1, 0.5)
beta_1 <- 0.8
beta_2 <- 0.6
beta_3 <- 1.2
# Calculate cumulative log odds
log_odds_1 <- beta_0[1] + beta_1*x1 +
  beta_2*x2 + beta_3*x1*x2
log_odds_2 <- beta_0[2] + beta_1*x1 +
  beta_2*x2 + beta_3*x1*x2
# Convert cumulative log odds to probabilities
p_y1 <- exp(log_odds_1) / (1 + exp(log_odds_1))
p_y2 <- exp(log_odds_2) / (1 + exp(log_odds_2)) - p_y1
p_y3 <- 1 - p_y1 - p_y2
# Simulate categorical outcome
y <- apply(cbind(p_y1, p_y2, p_y3), 1, \(p) {
  sample(1:3, size = 1, prob = p)
})
data <- data.frame(y = factor(y), x1 = x1, x2 = x2)
write.table(
  data, file = "data/1y ordinal 2xs continuous.txt", 
  row.names = FALSE, col.names = FALSE
)
# Fit ordinal logistic regression model
model_ordinal_x_continous <- polr(y ~ x1 * x2, data = data, Hess = TRUE)
summary(model_ordinal_x_continous)

#> Call:
#> polr(formula = y ~ x1 * x2, data = data, Hess = TRUE)
#> 
#> Coefficients:
#>         Value Std. Error t value
#> x1    -0.8612    0.08083 -10.654
#> x2    -0.6377    0.07518  -8.483
#> x1:x2 -1.2883    0.09921 -12.986
#> 
#> Intercepts:
#>     Value    Std. Error t value 
#> 1|2  -1.0830   0.0825   -13.1241
#> 2|3   0.5346   0.0746     7.1685
#> 
#> Residual Deviance: 1763.185 
#> AIC: 1773.185

# plot interaction effect
x2_plot <- seq(-3, 3, length.out = 100) 
beta_x1 <- coef(model_ordinal_x_continous)["x1"] + 
  coef(model_ordinal_x_continous)["x1:x2"] * x2_plot
df_plot <- data.frame(
  x2 = x2_plot,
  beta_x1 = beta_x1
)
p <- ggplot(df_plot, aes(x = x2, y = beta_x1)) +
  geom_line(linewidth = 1, color = "blue") +
  labs(
    title = "Effect of x1 on logodd of Y",
    x = "x2",
    y = "Effect (slope) of x1"
  )
ggplotly(p)

3.6.2 When \(x\)s are categorical

Both nominal and ordinal regression models can include categorical background variables, as well as their interaction terms. Again, from what we have learned in the previous coding section, we knew that to incorporate categorical background variable into regression model, we need to use coding scheme, such as dummy coding or effect coding, to convert categorical variable into a set of binary variables first.

Likewise, we only use logsitic regression (nominal and ordinal) as example in this subsection.

3.6.2.1 Nominal logistic regression

For example, for nominal logistic regression with 2 categorical \(x\)s (both have 3 categories and are dummy coded with 1st category as reference) and their interaction terms, we have \[\begin{align*} \log\left[\frac{P(Y=y_k)}{P(Y=y_K)}\right]&=\beta_{0k}+\\ &\quad\beta_{1k}d2x_1+\beta_{2k}d3x_1+\beta_{3k}d2x_2+\beta_{4k}d3x_2+\\ &\quad\beta_{5k}d2x_1d2x_2+\beta_{6k}d2x_1d3x_2+\beta_{7k}d3x_1d2x_2+\beta_{8k}d3x_1d3x_2. \end{align*}\]

# Simulate data, x1 and x2 are categorical, y is a 3-level nominal variable
set.seed(123)
n <- 5000
x1 <- factor(sample(1:3, n, replace = TRUE))
x2 <- factor(sample(1:3, n, replace = TRUE))
d2x1 <- ifelse(x1 == 2, 1, 0)
d3x1 <- ifelse(x1 == 3, 1, 0)
d2x2 <- ifelse(x2 == 2, 1, 0)
d3x2 <- ifelse(x2 == 3, 1, 0)
d2x1d2x2 <- d2x1 * d2x2
d2x1d3x2 <- d2x1 * d3x2
d3x1d2x2 <- d3x1 * d2x2
d3x1d3x2 <- d3x1 * d3x2
# Coefficients
beta_0 <- c(1, -0.5)
beta_1 <- c(0.2, 0.6)
beta_2 <- c(0.4, 0.2)
beta_3 <- c(-0.5, 0.3)
beta_4 <- c(0.4, 0.7)
beta_5 <- c(0.8, 0.2)
beta_6 <- c(0.6, -0.3)
beta_7 <- c(0.5, -0.5)
beta_8 <- c(-0.4, 1)
# Calculate probabilities for nominal logistic regression
log_odds_2 <- beta_0[1] + beta_1[1]*d2x1 + beta_2[1]*d3x1 + beta_3[1]*d2x2 + beta_4[1]*d3x2 +
  beta_5[1]*d2x1d2x2 + beta_6[1]*d2x1d3x2 +
  beta_7[1]*d3x1d2x2 + beta_8[1]*d3x1d3x2
log_odds_3 <- beta_0[2] + beta_1[2]*d2x1 + beta_2[2]*d3x1 + beta_3[2]*d2x2 + beta_4[2]*d3x2 +
  beta_5[2]*d2x1d2x2 + beta_6[2]*d2x1d3x2 +
  beta_7[2]*d3x1d2x2 + beta_8[2]*d3x1d3x2
# Convert log odds to probabilities
exp_log_odds_2 <- exp(log_odds_2)
exp_log_odds_3 <- exp(log_odds_3)
p_y1 <- 1/(1 + exp_log_odds_2 + exp_log_odds_3)
p_y2 <- p_y1*exp_log_odds_2
p_y3 <- p_y1*exp_log_odds_3
test <- as.data.frame(cbind(p_y1, p_y2, p_y3))
# Simulate categorical outcome
y_nominal <- apply(cbind(p_y1, p_y2, p_y3), 1, \(p) {
    sample(1:3, size = 1, prob = p)
})
data_nominal <- data.frame(y = factor(y_nominal), x1 = x1, x2 = x2)
write.table(
  data_nominal, file = "data/1y nominal 2xs categorical.txt",
  row.names = FALSE, col.names = FALSE
)
# Fit nominal logistic regression model
model_nominal_x_categorical <- multinom(y ~ x1 * x2, data = data_nominal)

#> # weights:  30 (18 variable)
#> initial  value 5493.061443 
#> iter  10 value 4443.228287
#> iter  20 value 4371.095845
#> final  value 4370.437920 
#> converged

summary(model_nominal_x_categorical)

#> Call:
#> multinom(formula = y ~ x1 * x2, data = data_nominal)
#> 
#> Coefficients:
#>   (Intercept)       x12       x13        x22       x23   x12:x22    x13:x22
#> 2   1.0452967 0.2445051 0.4128214 -0.4973365 0.2284541 0.8295829  0.4332127
#> 3  -0.6931519 0.7746404 0.3545355  0.5711358 0.7508490 0.1946781 -0.7641970
#>      x12:x23    x13:x23
#> 2  0.6081465 -0.2271243
#> 3 -0.3768715  1.0997605
#> 
#> Std. Errors:
#>   (Intercept)       x12       x13       x22       x23   x12:x22   x13:x22
#> 2   0.1052546 0.1520193 0.1555379 0.1474501 0.1541089 0.2297404 0.2184532
#> 3   0.1568124 0.2066875 0.2239507 0.1974346 0.2093726 0.2824636 0.3039697
#>     x12:x23   x13:x23
#> 2 0.2393852 0.2426308
#> 3 0.3056020 0.3019850
#> 
#> Residual Deviance: 8740.876 
#> AIC: 8776.876

df_plot_nominal_x_categorical <- data.frame(
  x1,
  x2,
  model_nominal_x_categorical$fitted.values
)
names(df_plot_nominal_x_categorical) <- c("x1", "x2", "y1", "y2", "y3")
df_plot_nominal_x_categorical <- pivot_longer(
  df_plot_nominal_x_categorical,
  cols = c("y1", "y2", "y3"),
  names_to = "y_level",
  values_to = "probability"
)
p <- ggplot(
  df_plot_nominal_x_categorical,
  aes(x = x1, y = probability, color = y_level)
) +
  geom_point() +
  geom_line(aes(group = y_level)) +
  facet_wrap(~ x2, labeller = "label_both") +
  labs(
    title = "Predicted probabilities of each level of y",
    x = "x1",
    y = "Predicted Probability"
  )
ggplotly(p)

3.7 Real data example

Section 2.5.3 of Xin et al. (2024).

3.8 Summary

Few takehome messages:

Modeling ordinal DV and nominal DV all result in \(K-1\) regression models.
For nominal \(y\), all regression parameters, including intercepts and slopes, are different across \(K-1\) models.
For ordinal \(y\), only intercepts are different across \(K-1\) regression models, slopes remain the same.
For nominal \(y\), if stats software used report threshold, remember to reverse its sign to write the right regression equation.
For ordinal \(y\), the sign of slopes are negative.
For binary \(y\), we only have 1 regression model and thus treating \(y\) as nominal is the same as treating it as ordinal, except for the signs of regression coefficients.