Chapter 6 Logistic Regression

  • When do we use this?: When \(Y\) is a binary variable (a categorical variable with exactly 2 outcomes).

6.1 Fitting the Model

m1 <- glm( y ~ x, data = , family = binomial)

  • Replace y with your categorical response variable.
  • Replace x with your explanatory variable.
  • data = your data set

6.2 Creating Mosaic Plots

mosaicplot( x ~ y, data = , xlab = "X Axis Label", ylab = "Y Axis Label" , main = "Plot Title")

  • Replace y with your categorical response variable.
  • Replace x with your categorical explanatory variable.
  • data = your data set
  • You can change what is in the parentheses to reflect the labels and title you want.

6.3 Creating an Empirical Logit Plot

logOddsPlot <- function(x, y, xname, formulahere){
  sort = order(x)
  x = x[sort]
  y = y[sort]
  a = seq(1, length(x), by=.05*length(y))
  b = c(a[-1] - 1, length(x))
  
  prob = xmean = ns = rep(0, length(a)) # ns is for CIs
  for (i in 1:length(a)){
    range = (a[i]):(b[i])
    prob[i] = mean(y[range])
    xmean[i] = mean(x[range])
    ns[i] = b[i] - a[i] + 1 # for CI 
  }
  
  extreme = (prob == 1 | prob == 0)
  #prob[prob == 0] = min(prob[!extreme])
  #prob[prob == 1] = max(prob[!extreme])
  prob[prob == 0] = .01
  prob[prob == 1] = .99
  
  g = log(prob/(1-prob))
  
  dataHere <- data.frame("x" = b, "LogOdds" = g)
  
  suppressMessages(library(ggplot2))
  
  ggplot(dataHere, aes(x =xmean, y = LogOdds)) + geom_point() + geom_smooth(formula = formulahere, method = "lm", se = FALSE) + labs(x = xname, y = "Log Odds")
}

Citation: This code was adapted from (http://alexschell.github.io/emplogit.html) [Alex Schell’s website].

To use the function, you will need to specify the inputs, meaning the pieces of information the computer needs in order to run the code.

  • x: the column containing the x variable (Titanic$Fare)
  • y: the column containing the y variable (Titanic$Survived). Note that this code sometimes requires you to put as.numeric around your variable (as.numeric(Titanic$Survived)) as it is specifically looking for a number between 0 and 1.
  • xlab: This is the x axis label you want on your graph, like “Age”. Make sure you put quotes around your label.
  • formulahere: This is the shape of the relationship you want. If you want a line, use y~x. If you want a 2nd order polynomial, use y~poly(x,2)

6.4 Making Predictions

probabilities <- predict(Model, type = "response")

predicted.Y <- ifelse(probabilities > THRESHOLD , 1, 0)

  • The first line creates predicted probability that \(Y_i = 1\) for each row \(i\).
  • The second row assign all rows with predicted probabilities greater than THRESHOLD as a 1, and all probabilities less than or equal to THRESHOLD as 0.
  • Replace “model” with the name of your model.
  • Replace DATA with the name of your data set.
  • Replace THRESHOLD with your chosen threshold value.

6.5 Confusion Matrix

knitr::kable( table("Prediction"= predicted.Y, "Actual" = DATA$Y) )

  • Replace DATA with the name of your data set.
  • Replace Y with your Y variable.

6.6 Hypothesis Testing

anova(Model1,Model2, test = "LRT")

  • This code is for a Nested Likelihood Ratio Test.
  • Replace Model1 with the name of your smaller model, and Model2 with the name of your larger model.

anova(Model1, test = "Chisq")

  • This code is for a Drop in Deviance Test.
  • Replace Model1 with the name of your model.