## 15.5 Logistic regression with `glm(family = "binomial"`

The most common non-normal regression analysis is logistic regression, where your dependent variable is just 0s and 1. To do a logistic regression analysis with `glm()`

, use the `family = binomial`

argument.

Let’s run a logistic regression on the diamonds dataset. First, I’ll create a binary variable called `value.g190`

indicating whether the value of a diamond is greater than 190 or not. Then, I’ll conduct a logistic regression with our new binary variable as the dependent variable. We’ll set `family = "binomial"`

to tell `glm()`

that the dependent variable is binary.

```
# Create a binary variable indicating whether or not
# a diamond's value is greater than 190
diamonds$value.g190 <- diamonds$value > 190
# Conduct a logistic regression on the new binary variable
diamond.glm <- glm(formula = value.g190 ~ weight + clarity + color,
data = diamonds,
family = binomial)
```

Here are the resulting coefficients:

```
# Print coefficients from logistic regression
summary(diamond.glm)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.8009 3.4634 -5.428 5.686e-08
## weight 1.1251 0.1968 5.716 1.088e-08
## clarity 9.2910 1.9629 4.733 2.209e-06
## color -0.3836 0.2481 -1.547 1.220e-01
```

Just like with regular regression with `lm()`

, we can get the fitted values from the model and put them back into our dataset to see how well the model fit the data:

```
# Add logistic fitted values back to dataframe as
# new column pred.g190
diamonds$pred.g190 <- diamond.glm$fitted.values
# Look at the first few rows (of the named columns)
head(diamonds[c("weight", "clarity", "color", "value", "pred.g190")])
## weight clarity color value pred.g190
## 1 9.35 0.88 4 182.5 0.16252
## 2 11.10 1.05 5 191.2 0.82130
## 3 8.65 0.85 6 175.7 0.03008
## 4 10.43 1.15 5 195.2 0.84559
## 5 10.62 0.92 5 181.6 0.44455
## 6 12.35 0.44 4 182.9 0.08688
```

Looking at the first few observations, it looks like the probabilities match the data pretty well. For example, the first diamond with a value of 182.5 had a fitted probability of just 0.16 of being valued greater than 190. In contrast, the second diamond, which had a value of 191.2 had a much higher fitted probability of 0.82.

Just like we did with regular regression, you can use the `predict()`

function along with the results of a `glm()`

object to predict new data. Let’s use the `diamond.glm`

object to predict the probability that the new diamonds will have a value greater than 190:

```
# Predict the 'probability' that the 3 new diamonds
# will have a value greater than 190
predict(object = diamond.glm,
newdata = diamonds.new)
## 1 2 3
## 4.8605 -3.5265 -0.3898
```

What the heck, these don’t look like probabilities! True, they’re not. They are *logit-transformed* probabilities. To turn them back into probabilities, we need to invert them by applying the inverse logit function:

```
# Get logit predictions of new diamonds
logit.predictions <- predict(object = diamond.glm,
newdata = diamonds.new
)
# Apply inverse logit to transform to probabilities
# (See Equation in the margin)
prob.predictions <- 1 / (1 + exp(-logit.predictions))
# Print final predictions!
prob.predictions
## 1 2 3
## 0.99231 0.02857 0.40376
```

So, the model predicts that the probability that the three new diamonds will be valued over 190 is 99.23%, 2.86%, and 40.38% respectively.

### 15.5.1 Adding a regression line to a plot

You can easily add a regression line to a scatterplot. To do this, just put the regression object you created with as the main argument to . For example, the following code will create the scatterplot on the right (Figure~) showing the relationship between a diamond’s weight and its value including a red regression line:

```
# Scatterplot of diamond weight and value
plot(x = diamonds$weight,
y = diamonds$value,
xlab = "Weight",
ylab = "Value",
main = "Adding a regression line with abline()"
)
# Calculate regression model
diamonds.lm <- lm(formula = value ~ weight,
data = diamonds)
# Add regression line
abline(diamonds.lm,
col = "red", lwd = 2)
```

### 15.5.2 Transforming skewed variables prior to standard regression

```
# The distribution of movie revenus is highly
# skewed.
hist(movies$revenue.all,
main = "Movie revenue\nBefore log-transformation")
```

If you have a highly skewed variable that you want to include in a regression analysis, you can do one of two things. Option 1 is to use the general linear model `glm()`

with an appropriate family (like `family = "gamma"`

). Option 2 is to do a standard regression analysis with `lm()`

, but before doing so, transforming the variable into something less skewed. For highly skewed data, the most common transformation is a log-transformation.

For example, look at the distribution of movie revenues in the movies dataset in the margin Figure 15.5:

As you can see, these data don’t look Normally distributed at all. There are a few movies (like Avatar) that just an obscene amount of money, and many movies that made much less. If we want to conduct a standard regression analysis on these data, we need to create a new log-transformed version of the variable. In the following code, I’ll create a new variable called `revenue.all.log`

defined as the logarithm of `revenue.all`

```
# Create a new log-transformed version of movie revenue
movies$revenue.all.log <- log(movies$revenue.all)
```

In Figure 15.6 you can see a histogram of the new log-transformed variable. It’s still skewed, but not nearly as badly as before, so I would be feel much better using this variable in a standard regression analysis with `lm()`

.

```
# Distribution of log-transformed
# revenue is much less skewed
hist(movies$revenue.all.log,
main = "Log-transformed Movie revenue")
```