## 15.5 Logistic regression with glm(family = "binomial"

The most common non-normal regression analysis is logistic regression, where your dependent variable is just 0s and 1. To do a logistic regression analysis with glm(), use the family = binomial argument.

Let’s run a logistic regression on the diamonds dataset. First, I’ll create a binary variable called value.g190 indicating whether the value of a diamond is greater than 190 or not. Then, I’ll conduct a logistic regression with our new binary variable as the dependent variable. We’ll set family = "binomial" to tell glm() that the dependent variable is binary.

# Create a binary variable indicating whether or not
#   a diamond's value is greater than 190
diamonds$value.g190 <- diamonds$value > 190

# Conduct a logistic regression on the new binary variable
diamond.glm <- glm(formula = value.g190 ~ weight + clarity + color,
data = diamonds,
family = binomial)

Here are the resulting coefficients:

# Print coefficients from logistic regression
y = diamonds$value, xlab = "Weight", ylab = "Value", main = "Adding a regression line with abline()" ) # Calculate regression model diamonds.lm <- lm(formula = value ~ weight, data = diamonds) # Add regression line abline(diamonds.lm, col = "red", lwd = 2) ### 15.5.2 Transforming skewed variables prior to standard regression # The distribution of movie revenus is highly # skewed. hist(movies$revenue.all,
main = "Movie revenue\nBefore log-transformation")

If you have a highly skewed variable that you want to include in a regression analysis, you can do one of two things. Option 1 is to use the general linear model glm() with an appropriate family (like family = "gamma"). Option 2 is to do a standard regression analysis with lm(), but before doing so, transforming the variable into something less skewed. For highly skewed data, the most common transformation is a log-transformation.

For example, look at the distribution of movie revenues in the movies dataset in the margin Figure 15.5:

As you can see, these data don’t look Normally distributed at all. There are a few movies (like Avatar) that just an obscene amount of money, and many movies that made much less. If we want to conduct a standard regression analysis on these data, we need to create a new log-transformed version of the variable. In the following code, I’ll create a new variable called revenue.all.log defined as the logarithm of revenue.all

# Create a new log-transformed version of movie revenue
movies$revenue.all.log <- log(movies$revenue.all)

In Figure 15.6 you can see a histogram of the new log-transformed variable. It’s still skewed, but not nearly as badly as before, so I would be feel much better using this variable in a standard regression analysis with lm().

# Distribution of log-transformed
#  revenue is much less skewed

hist(movies\$revenue.all.log,
main = "Log-transformed Movie revenue")