40.4 Model

We illustrate a multiple linear regression model appropriate for continuous outcomes:

\[ y_i = \beta_0+ \beta_1 \log(\mathrm{budget}_i + 1)+ \beta_2 \log(\mathrm{us\_gross}_i + 1)+ \beta_3 \mathrm{runtime}_i+ \beta_4 \mathrm{year}_i+ \varepsilon_i . \]

40.4.1 Assumptions

Linearity in parameters and approximately linear relationships after transformation
Independence of errors (or appropriately modeled dependence/clustered SEs)
Homoskedasticity of errors (or robust SEs)
Approximately normal errors for t tests/intervals (large-sample robustness often sufficient)

40.4.2 Why this model?

metascore is continuous; OLS is a natural baseline.
Financial variables are skewed; log-transform helps stabilize variance and linearize relationships.
year and runtime capture secular trends and content length.

40.4.3 Considerations

Interactions: e.g., does budget effectiveness differ by year?
Collinearity: budget and gross can be correlated; check VIF.
Dependence: panel or clustered structures (franchise, studio) may require cluster-robust SEs.

fit <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)

# Core jtools summary (unstandardized)
summ(fit)

Observations	831
Dependent variable	metascore
Type	OLS linear regression

F(4,826)	41.99
R²	0.17
Adj. R²	0.16

	Est.	S.E.	t val.	p
(Intercept)	108.91	131.81	0.83	0.41
log1p(budget)	-6.67	0.62	-10.75	0.00
log1p(us_gross)	3.80	0.54	7.08	0.00
runtime	14.27	1.65	8.65	0.00
year	-0.01	0.07	-0.19	0.85
Standard errors: OLS

# Enhanced jtools summary: standardized coefs, VIFs, semi-partial correlations, CIs
summ(
  fit,
  scale = TRUE,      # standardized betas
  vifs = TRUE,       # collinearity diagnostics
  part.corr = TRUE,  # semi-partial (part) correlations
  confint = TRUE,
  pvals = TRUE
)

Observations	831
Dependent variable	metascore
Type	OLS linear regression

F(4,826)	41.99
R²	0.17
Adj. R²	0.16

	Est.	2.5%	97.5%	t val.	p	VIF	partial.r	part.r
(Intercept)	63.01	61.96	64.06	117.57	0.00	NA	NA	NA
`log1p(budget)`	-7.81	-9.24	-6.38	-10.75	0.00	1.84	-0.35	-0.34
`log1p(us_gross)`	5.15	3.73	6.58	7.08	0.00	1.84	0.24	0.22
runtime	5.02	3.88	6.16	8.65	0.00	1.17	0.29	0.27
year	-0.11	-1.26	1.04	-0.19	0.85	1.19	-0.01	-0.01
Standard errors: OLS; Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.

# Interactions: visualize and test
fit_int <- lm(metascore ~ log1p(budget)*year + log1p(us_gross) + runtime, data = movies_small)

# Create the transformed variable in your dataset
movies_small$log_budget <- log1p(movies_small$budget)

# Fit the model with the new variable
fit_int <- lm(metascore ~ log_budget*year + log1p(us_gross) + runtime, 
              data = movies_small)

# Now visualize
interactions::interact_plot(fit_int,
                           pred = year,
                           modx = log_budget,
                           interval = TRUE,
                           plot.points = TRUE) +
  theme_bw(base_size = 12)

# Collinearity (car::vif)
car::vif(fit)
#>   log1p(budget) log1p(us_gross)         runtime            year 
#>        1.835578        1.841513        1.172572        1.192862

40.4.4 Model Fit

We report \(R^2\), adjusted \(R^2\), residual standard error (RSE), and information criteria (AIC/BIC). For nested models, we can use ANOVA; otherwise, compare AIC/BIC, cross-validated error, or out-of-sample performance.

We then examine residuals for normality, heteroskedasticity, and influential points. If assumptions are questionable, prefer heteroskedasticity-robust or cluster-robust standard errors.

broom::glance(fit) %>%
  select(r.squared, adj.r.squared, sigma, AIC, BIC, df, nobs)
#> # A tibble: 1 × 7
#>   r.squared adj.r.squared sigma   AIC   BIC    df  nobs
#>       <dbl>         <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1     0.169         0.165  15.4 6915. 6943.     4   831

# Base residual diagnostics (four-panel plots)
par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))

# Formal tests (use judiciously; they're sensitive to n)
if (requireNamespace("car", quietly = TRUE)) {
  car::ncvTest(fit)                     # non-constant variance test
  car::durbinWatsonTest(fit)            # autocorrelation test (time ordering needed)
}
#>  lag Autocorrelation D-W Statistic p-value
#>    1      0.03649805      1.923838   0.268
#>  Alternative hypothesis: rho != 0
lmtest::bptest(fit)                      # Breusch-Pagan heteroskedasticity test
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  fit
#> BP = 18.909, df = 4, p-value = 0.0008191

# Partial residual plots
if (requireNamespace("car", quietly = TRUE)) {
  car::crPlots(fit)
}

40.4.5 Cluster-Robust Standard Errors

When heteroskedasticity or clustering is present, use sandwich estimators:

HC0/HC1/HC2/HC3: robust to heteroskedasticity
Cluster-robust (by firm, school, market, etc.) when errors correlate within clusters

Below: examples using jtools::summ() and lmtest::coeftest() with sandwich.

# Heteroskedasticity-robust SEs for fit
lmtest::coeftest(fit, vcov. = sandwich::vcovHC(fit, type = "HC3"))
#> 
#> t test of coefficients:
#> 
#>                   Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)     108.910293 145.640262  0.7478    0.4548    
#> log1p(budget)    -6.669123   0.708027 -9.4193 < 2.2e-16 ***
#> log1p(us_gross)   3.803854   0.552399  6.8861 1.136e-11 ***
#> runtime          14.268967   1.641196  8.6942 < 2.2e-16 ***
#> year             -0.012659   0.072085 -0.1756    0.8606    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# jtools in one line
summ(fit, robust = "HC3", confint = TRUE)

Observations	831
Dependent variable	metascore
Type	OLS linear regression

F(4,826)	41.99
R²	0.17
Adj. R²	0.16

	Est.	2.5%	97.5%	t val.	p
(Intercept)	108.91	-176.96	394.78	0.75	0.45
log1p(budget)	-6.67	-8.06	-5.28	-9.42	0.00
log1p(us_gross)	3.80	2.72	4.89	6.89	0.00
runtime	14.27	11.05	17.49	8.69	0.00
year	-0.01	-0.15	0.13	-0.18	0.86
Standard errors: Robust, type = HC3

# Cluster-robust SEs using example data from 'sandwich'
data("PetersenCL", package = "sandwich")
fit2 <- lm(y ~ x, data = PetersenCL)

# Cluster on 'firm'
summ(fit2, robust = "HC3", cluster = "firm")

Observations	5000
Dependent variable	y
Type	OLS linear regression

F(1,4998)	1310.74
R²	0.21
Adj. R²	0.21

	Est.	S.E.	t val.	p
(Intercept)	0.03	0.07	0.44	0.66
x	1.03	0.05	20.36	0.00
Standard errors: Cluster-robust, type = HC3

See Table 40.1 for a quick reference:

Table 40.1: Sandwich Variants
Type	Applicable	Usage	Notes/References
const	iid	Homskedastic	Assumes constant variance
HC/HC0	vcovHC	Heteroskedastic	White’s estimator (White 1980)
HC1	vcovHC	Heteroskedastic	DoF correction (J. G. MacKinnon and White 1985)
HC2	vcovHC	Heteroskedastic	Hat-matrix adjustment
HC3	vcovHC	Heteroskedastic	Preferred for small \(n\); hat-matrix adjustment
HC4/4m/5	vcovHC	Heteroskedastic	Leverage-adaptive

# Another quick demo on built-in data
data(cars)
model <- lm(speed ~ dist, data = cars)
summary(model)
#> 
#> Call:
#> lm(formula = speed ~ dist, data = cars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.5293 -2.1550  0.3615  2.4377  6.4179 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
#> dist         0.16557    0.01749   9.464 1.49e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.156 on 48 degrees of freedom
#> Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
#> F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
lmtest::coeftest(model, vcov. = sandwich::vcovHC(model, type = "HC1"))
#> 
#> t test of coefficients:
#> 
#>             Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept) 8.283906   0.891860  9.2883 2.682e-12 ***
#> dist        0.165568   0.019402  8.5335 3.482e-11 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

40.4.6 Model to Equation

Use equatiomatic to extract LaTeX-ready equations. If unavailable, we provide a fallback to print the fitted equation with estimated coefficients.

# install.packages("equatiomatic") # not available for some R versions (e.g., R 4.2)
fit_eq <- lm(metascore ~ log1p(budget) + log1p(us_gross) + runtime + year, data = movies_small)

# Show the theoretical model
equatiomatic::extract_eq(fit_eq)

# Display the actual coefficients
equatiomatic::extract_eq(fit_eq, use_coefs = TRUE)

# Fallback: build a simple equation string with estimated coefs
print_lm_equation <- function(mod) {
  co <- coef(mod)
  terms <- names(co)
  terms_fmt <- ifelse(terms == "(Intercept)",
                      sprintf("%.3f", co[terms]),
                      paste0(sprintf("%.3f", co[terms]), " \\cdot ", terms))
  rhs <- paste(terms_fmt, collapse = " + ")
  asis <- paste0("$\\hat{y} = ", rhs, "$")
  cat(asis, "\n")
}
print_lm_equation(fit)
#> $\hat{y} = 108.910 + -6.669 \cdot log1p(budget) + 3.804 \cdot log1p(us_gross) + 14.269 \cdot runtime + -0.013 \cdot year$

References

MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29 (3): 305–25.

White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica: Journal of the Econometric Society, 817–38.