15.5 Logistic regression with
glm(family = "binomial"
The most common non-normal regression analysis is logistic regression, where your dependent variable is just 0s and 1. To do a logistic regression analysis with
glm(), use the
family = binomial argument.
Let’s run a logistic regression on the diamonds dataset. First, I’ll create a binary variable called
value.g190 indicating whether the value of a diamond is greater than 190 or not. Then, I’ll conduct a logistic regression with our new binary variable as the dependent variable. We’ll set
family = "binomial" to tell
glm() that the dependent variable is binary.
# Create a binary variable indicating whether or not # a diamond's value is greater than 190 diamonds$value.g190 <- diamonds$value > 190 # Conduct a logistic regression on the new binary variable diamond.glm <- glm(formula = value.g190 ~ weight + clarity + color, data = diamonds, family = binomial)
Here are the resulting coefficients:
# Print coefficients from logistic regression summary(diamond.glm)$coefficients ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -18.8009 3.4634 -5.428 5.686e-08 ## weight 1.1251 0.1968 5.716 1.088e-08 ## clarity 9.2910 1.9629 4.733 2.209e-06 ## color -0.3836 0.2481 -1.547 1.220e-01
Just like with regular regression with
lm(), we can get the fitted values from the model and put them back into our dataset to see how well the model fit the data:
# Add logistic fitted values back to dataframe as # new column pred.g190 diamonds$pred.g190 <- diamond.glm$fitted.values # Look at the first few rows (of the named columns) head(diamonds[c("weight", "clarity", "color", "value", "pred.g190")]) ## weight clarity color value pred.g190 ## 1 9.35 0.88 4 182.5 0.16252 ## 2 11.10 1.05 5 191.2 0.82130 ## 3 8.65 0.85 6 175.7 0.03008 ## 4 10.43 1.15 5 195.2 0.84559 ## 5 10.62 0.92 5 181.6 0.44455 ## 6 12.35 0.44 4 182.9 0.08688
Looking at the first few observations, it looks like the probabilities match the data pretty well. For example, the first diamond with a value of 182.5 had a fitted probability of just 0.16 of being valued greater than 190. In contrast, the second diamond, which had a value of 191.2 had a much higher fitted probability of 0.82.
Just like we did with regular regression, you can use the
predict() function along with the results of a
glm() object to predict new data. Let’s use the
diamond.glm object to predict the probability that the new diamonds will have a value greater than 190:
# Predict the 'probability' that the 3 new diamonds # will have a value greater than 190 predict(object = diamond.glm, newdata = diamonds.new) ## 1 2 3 ## 4.8605 -3.5265 -0.3898
What the heck, these don’t look like probabilities! True, they’re not. They are logit-transformed probabilities. To turn them back into probabilities, we need to invert them by applying the inverse logit function:
# Get logit predictions of new diamonds logit.predictions <- predict(object = diamond.glm, newdata = diamonds.new ) # Apply inverse logit to transform to probabilities # (See Equation in the margin) prob.predictions <- 1 / (1 + exp(-logit.predictions)) # Print final predictions! prob.predictions ## 1 2 3 ## 0.99231 0.02857 0.40376
So, the model predicts that the probability that the three new diamonds will be valued over 190 is 99.23%, 2.86%, and 40.38% respectively.
15.5.1 Adding a regression line to a plot
You can easily add a regression line to a scatterplot. To do this, just put the regression object you created with as the main argument to . For example, the following code will create the scatterplot on the right (Figure~) showing the relationship between a diamond’s weight and its value including a red regression line:
# Scatterplot of diamond weight and value plot(x = diamonds$weight, y = diamonds$value, xlab = "Weight", ylab = "Value", main = "Adding a regression line with abline()" ) # Calculate regression model diamonds.lm <- lm(formula = value ~ weight, data = diamonds) # Add regression line abline(diamonds.lm, col = "red", lwd = 2)
15.5.2 Transforming skewed variables prior to standard regression
# The distribution of movie revenus is highly # skewed. hist(movies$revenue.all, main = "Movie revenue\nBefore log-transformation")
If you have a highly skewed variable that you want to include in a regression analysis, you can do one of two things. Option 1 is to use the general linear model
glm() with an appropriate family (like
family = "gamma"). Option 2 is to do a standard regression analysis with
lm(), but before doing so, transforming the variable into something less skewed. For highly skewed data, the most common transformation is a log-transformation.
For example, look at the distribution of movie revenues in the movies dataset in the margin Figure 15.5:
As you can see, these data don’t look Normally distributed at all. There are a few movies (like Avatar) that just an obscene amount of money, and many movies that made much less. If we want to conduct a standard regression analysis on these data, we need to create a new log-transformed version of the variable. In the following code, I’ll create a new variable called
revenue.all.log defined as the logarithm of
# Create a new log-transformed version of movie revenue movies$revenue.all.log <- log(movies$revenue.all)
In Figure 15.6 you can see a histogram of the new log-transformed variable. It’s still skewed, but not nearly as badly as before, so I would be feel much better using this variable in a standard regression analysis with
# Distribution of log-transformed # revenue is much less skewed hist(movies$revenue.all.log, main = "Log-transformed Movie revenue")