Chapter 6 Logistic Regression
- When do we use this?: When \(Y\) is a binary variable (a categorical variable with exactly 2 outcomes).
6.1 Fitting the Model
m1 <- glm( y ~ x, data = , family = binomial)
- Replace y with your categorical response variable.
- Replace x with your explanatory variable.
- data = your data set
6.2 Creating Mosaic Plots
mosaicplot( x ~ y, data = , xlab = "X Axis Label", ylab = "Y Axis Label" , main = "Plot Title")
- Replace y with your categorical response variable.
- Replace x with your categorical explanatory variable.
- data = your data set
- You can change what is in the parentheses to reflect the labels and title you want.
6.3 Creating an Empirical Logit Plot
logOddsPlot <- function(x, y, xname, formulahere){
sort = order(x)
x = x[sort]
y = y[sort]
a = seq(1, length(x), by=.05*length(y))
b = c(a[-1] - 1, length(x))
prob = xmean = ns = rep(0, length(a)) # ns is for CIs
for (i in 1:length(a)){
range = (a[i]):(b[i])
prob[i] = mean(y[range])
xmean[i] = mean(x[range])
ns[i] = b[i] - a[i] + 1 # for CI
}
extreme = (prob == 1 | prob == 0)
#prob[prob == 0] = min(prob[!extreme])
#prob[prob == 1] = max(prob[!extreme])
prob[prob == 0] = .01
prob[prob == 1] = .99
g = log(prob/(1-prob))
dataHere <- data.frame("x" = b, "LogOdds" = g)
suppressMessages(library(ggplot2))
ggplot(dataHere, aes(x =xmean, y = LogOdds)) + geom_point() + geom_smooth(formula = formulahere, method = "lm", se = FALSE) + labs(x = xname, y = "Log Odds")
}
Citation: This code was adapted from (http://alexschell.github.io/emplogit.html) [Alex Schell’s website].
To use the function, you will need to specify the inputs, meaning the pieces of information the computer needs in order to run the code.
- x: the column containing the x variable (
Titanic$Fare
) - y: the column containing the y variable (
Titanic$Survived
). Note that this code sometimes requires you to putas.numeric
around your variable (as.numeric(Titanic$Survived)
) as it is specifically looking for a number between 0 and 1. - xlab: This is the x axis label you want on your graph, like “Age”. Make sure you put quotes around your label.
- formulahere: This is the shape of the relationship you want. If you want a line, use
y~x
. If you want a 2nd order polynomial, usey~poly(x,2)
6.4 Making Predictions
probabilities <- predict(Model, type = "response")
predicted.Y <- ifelse(probabilities > THRESHOLD , 1, 0)
- The first line creates predicted probability that \(Y_i = 1\) for each row \(i\).
- The second row assign all rows with predicted probabilities greater than THRESHOLD as a 1, and all probabilities less than or equal to THRESHOLD as 0.
- Replace “model” with the name of your model.
- Replace DATA with the name of your data set.
- Replace THRESHOLD with your chosen threshold value.
6.5 Confusion Matrix
knitr::kable( table("Prediction"= predicted.Y, "Actual" = DATA$Y) )
- Replace DATA with the name of your data set.
- Replace Y with your Y variable.
6.6 Hypothesis Testing
anova(Model1,Model2, test = "LRT")
- This code is for a Nested Likelihood Ratio Test.
- Replace Model1 with the name of your smaller model, and Model2 with the name of your larger model.
anova(Model1, test = "Chisq")
- This code is for a Drop in Deviance Test.
- Replace Model1 with the name of your model.