40.4 Model

We illustrate a multiple linear regression model appropriate for continuous outcomes:

\[ y_i = \beta_0+ \beta_1 \log(\mathrm{budget}_i + 1)+ \beta_2 \log(\mathrm{us\_gross}_i + 1)+ \beta_3 \mathrm{runtime}_i+ \beta_4 \mathrm{year}_i+ \varepsilon_i . \]

40.4.1 Assumptions

  • Linearity in parameters and approximately linear relationships after transformation
  • Independence of errors (or appropriately modeled dependence/clustered SEs)
  • Homoskedasticity of errors (or robust SEs)
  • Approximately normal errors for t tests/intervals (large-sample robustness often sufficient)

40.4.2 Why this model?

  • metascore is continuous; OLS is a natural baseline.
  • Financial variables are skewed; log-transform helps stabilize variance and linearize relationships.
  • year and runtime capture secular trends and content length.

40.4.3 Considerations

  • Interactions: e.g., does budget effectiveness differ by year?
  • Collinearity: budget and gross can be correlated; check VIF.
  • Dependence: panel or clustered structures (franchise, studio) may require cluster-robust SEs.
fit <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)

# Core jtools summary (unstandardized)
summ(fit)
Observations 831
Dependent variable metascore
Type OLS linear regression
F(4,826) 41.99
0.17
Adj. R² 0.16
Est. S.E. t val. p
(Intercept) 108.91 131.81 0.83 0.41
log1p(budget) -6.67 0.62 -10.75 0.00
log1p(us_gross) 3.80 0.54 7.08 0.00
runtime 14.27 1.65 8.65 0.00
year -0.01 0.07 -0.19 0.85
Standard errors: OLS
# Enhanced jtools summary: standardized coefs, VIFs, semi-partial correlations, CIs
summ(
  fit,
  scale = TRUE,      # standardized betas
  vifs = TRUE,       # collinearity diagnostics
  part.corr = TRUE,  # semi-partial (part) correlations
  confint = TRUE,
  pvals = TRUE
)
Observations 831
Dependent variable metascore
Type OLS linear regression
F(4,826) 41.99
0.17
Adj. R² 0.16
Est. 2.5% 97.5% t val. p VIF partial.r part.r
(Intercept) 63.01 61.96 64.06 117.57 0.00 NA NA NA
log1p(budget) -7.81 -9.24 -6.38 -10.75 0.00 1.84 -0.35 -0.34
log1p(us_gross) 5.15 3.73 6.58 7.08 0.00 1.84 0.24 0.22
runtime 5.02 3.88 6.16 8.65 0.00 1.17 0.29 0.27
year -0.11 -1.26 1.04 -0.19 0.85 1.19 -0.01 -0.01
Standard errors: OLS; Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
# Interactions: visualize and test
fit_int <- lm(metascore ~ log1p(budget)*year + log1p(us_gross) + runtime, data = movies_small)

# Create the transformed variable in your dataset
movies_small$log_budget <- log1p(movies_small$budget)

# Fit the model with the new variable
fit_int <- lm(metascore ~ log_budget*year + log1p(us_gross) + runtime, 
              data = movies_small)

# Now visualize
interactions::interact_plot(fit_int,
                           pred = year,
                           modx = log_budget,
                           interval = TRUE,
                           plot.points = TRUE) +
  theme_bw(base_size = 12)

# Collinearity (car::vif)
car::vif(fit)
#>   log1p(budget) log1p(us_gross)         runtime            year 
#>        1.835578        1.841513        1.172572        1.192862

40.4.4 Model Fit

We report \(R^2\), adjusted \(R^2\), residual standard error (RSE), and information criteria (AIC/BIC). For nested models, we can use ANOVA; otherwise, compare AIC/BIC, cross-validated error, or out-of-sample performance.

We then examine residuals for normality, heteroskedasticity, and influential points. If assumptions are questionable, prefer heteroskedasticity-robust or cluster-robust standard errors.

broom::glance(fit) %>%
  select(r.squared, adj.r.squared, sigma, AIC, BIC, df, nobs)
#> # A tibble: 1 × 7
#>   r.squared adj.r.squared sigma   AIC   BIC    df  nobs
#>       <dbl>         <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1     0.169         0.165  15.4 6915. 6943.     4   831
# Base residual diagnostics (four-panel plots)
par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))
# Formal tests (use judiciously; they're sensitive to n)
if (requireNamespace("car", quietly = TRUE)) {
  car::ncvTest(fit)                     # non-constant variance test
  car::durbinWatsonTest(fit)            # autocorrelation test (time ordering needed)
}
#>  lag Autocorrelation D-W Statistic p-value
#>    1      0.03649805      1.923838   0.268
#>  Alternative hypothesis: rho != 0
lmtest::bptest(fit)                      # Breusch-Pagan heteroskedasticity test
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  fit
#> BP = 18.909, df = 4, p-value = 0.0008191
# Partial residual plots
if (requireNamespace("car", quietly = TRUE)) {
  car::crPlots(fit)
}

40.4.5 Cluster-Robust Standard Errors

When heteroskedasticity or clustering is present, use sandwich estimators:

  • HC0/HC1/HC2/HC3: robust to heteroskedasticity

  • Cluster-robust (by firm, school, market, etc.) when errors correlate within clusters

Below: examples using jtools::summ() and lmtest::coeftest() with sandwich.

# Heteroskedasticity-robust SEs for fit
lmtest::coeftest(fit, vcov. = sandwich::vcovHC(fit, type = "HC3"))
#> 
#> t test of coefficients:
#> 
#>                   Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)     108.910293 145.640262  0.7478    0.4548    
#> log1p(budget)    -6.669123   0.708027 -9.4193 < 2.2e-16 ***
#> log1p(us_gross)   3.803854   0.552399  6.8861 1.136e-11 ***
#> runtime          14.268967   1.641196  8.6942 < 2.2e-16 ***
#> year             -0.012659   0.072085 -0.1756    0.8606    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# jtools in one line
summ(fit, robust = "HC3", confint = TRUE)
Observations 831
Dependent variable metascore
Type OLS linear regression
F(4,826) 41.99
0.17
Adj. R² 0.16
Est. 2.5% 97.5% t val. p
(Intercept) 108.91 -176.96 394.78 0.75 0.45
log1p(budget) -6.67 -8.06 -5.28 -9.42 0.00
log1p(us_gross) 3.80 2.72 4.89 6.89 0.00
runtime 14.27 11.05 17.49 8.69 0.00
year -0.01 -0.15 0.13 -0.18 0.86
Standard errors: Robust, type = HC3
# Cluster-robust SEs using example data from 'sandwich'
data("PetersenCL", package = "sandwich")
fit2 <- lm(y ~ x, data = PetersenCL)

# Cluster on 'firm'
summ(fit2, robust = "HC3", cluster = "firm")
Observations 5000
Dependent variable y
Type OLS linear regression
F(1,4998) 1310.74
0.21
Adj. R² 0.21
Est. S.E. t val. p
(Intercept) 0.03 0.07 0.44 0.66
x 1.03 0.05 20.36 0.00
Standard errors: Cluster-robust, type = HC3

See Table 40.1 for a quick reference:

Table 40.1: Sandwich Variants
Type Applicable Usage Notes/References
const iid Homskedastic Assumes constant variance
HC/HC0 vcovHC Heteroskedastic White’s estimator (White 1980)
HC1 vcovHC Heteroskedastic DoF correction (J. G. MacKinnon and White 1985)
HC2 vcovHC Heteroskedastic Hat-matrix adjustment
HC3 vcovHC Heteroskedastic Preferred for small \(n\); hat-matrix adjustment
HC4/4m/5 vcovHC Heteroskedastic Leverage-adaptive
# Another quick demo on built-in data
data(cars)
model <- lm(speed ~ dist, data = cars)
summary(model)
#> 
#> Call:
#> lm(formula = speed ~ dist, data = cars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.5293 -2.1550  0.3615  2.4377  6.4179 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
#> dist         0.16557    0.01749   9.464 1.49e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.156 on 48 degrees of freedom
#> Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
#> F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
lmtest::coeftest(model, vcov. = sandwich::vcovHC(model, type = "HC1"))
#> 
#> t test of coefficients:
#> 
#>             Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept) 8.283906   0.891860  9.2883 2.682e-12 ***
#> dist        0.165568   0.019402  8.5335 3.482e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

40.4.6 Model to Equation

Use equatiomatic to extract LaTeX-ready equations. If unavailable, we provide a fallback to print the fitted equation with estimated coefficients.

# install.packages("equatiomatic") # not available for some R versions (e.g., R 4.2)
fit_eq <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)

# Show the theoretical model
equatiomatic::extract_eq(fit_eq)

# Display the actual coefficients
equatiomatic::extract_eq(fit_eq, use_coefs = TRUE)
# Fallback: build a simple equation string with estimated coefs
print_lm_equation <- function(mod) {
  co <- coef(mod)
  terms <- names(co)
  terms_fmt <- ifelse(terms == "(Intercept)",
                      sprintf("%.3f", co[terms]),
                      paste0(sprintf("%.3f", co[terms]), " \\cdot ", terms))
  rhs <- paste(terms_fmt, collapse = " + ")
  asis <- paste0("$\\hat{y} = ", rhs, "$")
  cat(asis, "\n")
}
print_lm_equation(fit)
#> $\hat{y} = 108.910 + -6.669 \cdot log1p(budget) + 3.804 \cdot log1p(us_gross) + 14.269 \cdot runtime + -0.013 \cdot year$

References

MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29 (3): 305–25.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica: Journal of the Econometric Society, 817–38.