Chapter 3 Multiple Linear Regression

3.1 The Model

We can extend the simple linear regression model to the situation where we have \(k\) predictor variables \(X_1,X_2,\cdots,X_k\) and a response variable \(Y\). Think of simple linear regression as the special case where \(k=1\), the graph of the data is a 2-D scatterplot, and the least squares regression model is merely the equation of a straight line.

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + \epsilon_i\] or \[Y_i = \beta_0 + \sum_{j=1}^k \beta_i X_{ji} + \epsilon_i\] We still make the same assumptions as before. Note that the variable \(Y\) is normally distributed with variance \(\sigma^2\) for each combination of \(X_1,X_2,\cdots,X_K\), or equivalently \(\epsilon \sim N(0,\sigma^2)\). If \(k=2\) (i.e. exactly 2 predictors), the scatterplot will be in 3-D and we will fit the least squares regression plane to the data. When \(k \geq 3\), we fit a regression surface (often called a response surface). If we let some \(X_j\) be higher-odrder functions of the basic independent variables (i.e. something like \(X_3 =X_2^2\) or \(X_4 = X_1 X_2\)), the response surface will be nonlinear.

3.2 The AccordPrices Example

Let’s put this is action using the AccordPrice data set in the Stat2Data package, where we will predict Price using both the car’s Age and its Mileage. I’ll install the GGally package to help with some plotting.

require(tidyverse)
require(GGally)
require(Stat2Data)
data(AccordPrice)
ggpairs(AccordPrice) # from GGally

# fit model
honda.model <- lm(Price ~ Age + Mileage, data=AccordPrice)
summary(honda.model)
## 
## Call:
## lm(formula = Price ~ Age + Mileage, data = AccordPrice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5832 -1.5016 -0.1708  0.9218  6.8996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.73825    0.72194   30.11  < 2e-16 ***
## Age         -0.79138    0.15764   -5.02 2.88e-05 ***
## Mileage     -0.05315    0.01682   -3.16  0.00387 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.259 on 27 degrees of freedom
## Multiple R-squared:  0.8555, Adjusted R-squared:  0.8448 
## F-statistic: 79.96 on 2 and 27 DF,  p-value: 4.532e-12

I doubt you are surprised that both Age and Mileage are negatively correlated with the Price, while they are positively corrleated with each other. The \(t\)-tests for both Age and Mileage are statistically signficiant at a traditional level of \(\alpha=0.05\)

Age: \(t=-5.02\), \(df=27\), \(p<.0001\)

Mileage: \(t=-3.16\), \(df=27\), \(p=.0039\)

A relatively simple measure of the “quality” of a regression model is \(R^2\). This was just the squared correlation in simple linear regression. Now, it is the squared multiple correlation coefficient between the variables \(Y, X_1, \cdots, X_K\). It can be computed from the sum of squares as:

\[R^2 = \frac{SSModel}{SSY} = 1 - \frac{SSE}{SSY}\]

Note that the multiple \(R^2\) for our two-variable regression model for predicting the price of a Honda Accord is \(R^2=0.8555\). Over 85% of the variation in the price of the used car is acccounted for by this linear model that uses the car’s age and mileage.

The two simple linear regression models return \(r^2=0.8021\) (for Age) and \(r^2=0.7207\) (for Mileage). A flaw of the multiple \(R^2\) statistic is that it will always increase when a new predictor \(X_j\) is added to the model, even if it has virtually no correlation with \(Y\), such as the last digit of the car’s VIN number.

A variation called Adjusted \(R^2\) is also computed by R with the formula

\[\text{Adjusted } R^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1}\]

This formula is a crude way to “penalize” a model for becoming more complicated with the addition of \(\beta_j\) terms (slopes for new variables). The Principle of Parsimony tells us that we want the simplest model that adequately explains a phenonmenon. We’ll see better ways of selecting a model in future weeks.

3.3 The airquality Example

Let’s look at the airquality example from base R. There are \(k=3\) predictors where the direction of the relationship between the response Ozone (the ozone level, measured in parts per billion) and the predictors Solar.R (in langleys), Wind (average wind speed in miles per hour), and Temp (maximum daily temperature in degrees Fahrenheit). I’ll ignore the Month and Day variables in my regression model. Notice that many data values are missing and denoted as NA in the data set.

ggpairs(airquality,columns=1:4,aes(color=factor(Month)))

From the scatterplot matrix, we see that the Ozone level is positively correleated with Solar.R and Temp but is negatively correlated with Wind. I’m going to fit the multiple regression model, provide the summary which will have \(t\)-tests on each \(X_j\), and the ANOVA tables computed with both the anova and Anova functions (which will be different, unlike simple linear regression!)

I’m going to create a data set with all of the observations with missing data values NA removed. Also, we will learn about Type I and Type II Sums of squares and when you would choose one or the other.

# create a data set with all cases with missing data removed
aq.data <- airquality %>%
  drop_na()

summary(aq.data)
##      Ozone          Solar.R           Wind            Temp      
##  Min.   :  1.0   Min.   :  7.0   Min.   : 2.30   Min.   :57.00  
##  1st Qu.: 18.0   1st Qu.:113.5   1st Qu.: 7.40   1st Qu.:71.00  
##  Median : 31.0   Median :207.0   Median : 9.70   Median :79.00  
##  Mean   : 42.1   Mean   :184.8   Mean   : 9.94   Mean   :77.79  
##  3rd Qu.: 62.0   3rd Qu.:255.5   3rd Qu.:11.50   3rd Qu.:84.50  
##  Max.   :168.0   Max.   :334.0   Max.   :20.70   Max.   :97.00  
##      Month            Day       
##  Min.   :5.000   Min.   : 1.00  
##  1st Qu.:6.000   1st Qu.: 9.00  
##  Median :7.000   Median :16.00  
##  Mean   :7.216   Mean   :15.95  
##  3rd Qu.:9.000   3rd Qu.:22.50  
##  Max.   :9.000   Max.   :31.00
require(mosaic)
favstats(~Ozone,data=airquality) # original data
##  min Q1 median    Q3 max     mean       sd   n missing
##    1 18   31.5 63.25 168 42.12931 32.98788 116      37
favstats(~Ozone,data=aq.data) # 'cleaned' data
##  min Q1 median Q3 max    mean       sd   n missing
##    1 18     31 62 168 42.0991 33.27597 111       0
aq.model <- lm(Ozone ~ Solar.R + Temp + Wind, data=aq.data)
summary(aq.model)
## 
## Call:
## lm(formula = Ozone ~ Solar.R + Temp + Wind, data = aq.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
anova(aq.model)
## Analysis of Variance Table
## 
## Response: Ozone
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Solar.R     1  14780   14780  32.944 8.946e-08 ***
## Temp        1  47378   47378 105.607 < 2.2e-16 ***
## Wind        1  11642   11642  25.950 1.516e-06 ***
## Residuals 107  48003     449                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
require(car)
Anova(aq.model)
## Anova Table (Type II tests)
## 
## Response: Ozone
##           Sum Sq  Df F value    Pr(>F)    
## Solar.R     2986   1  6.6563   0.01124 *  
## Temp       19050   1 42.4630 2.424e-09 ***
## Wind       11642   1 25.9495 1.516e-06 ***
## Residuals  48003 107                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recall we compute sums of squares with:

\[SSY = \sum_{i=1}^n (Y-\bar{Y})^2\]

\[SSModel = \sum_{i=1}^n (\hat{Y}-\bar{Y})^2\] \[SSE = \sum_{i=1}^n (Y-\hat{Y})^2\]

The anova function with the lower case a computes what are known as Type I SS, also called “variables-added-in-order” or sequential sums of squares. These are used when the sequence of adding the \(X\) variables into the model is important to us.

Both the summary and Anova functions (upper case A) computes what are known as Type II SS, also called “variables-added-last” or unique sums of squares. These are used when the sequence of adding the \(X\) variables into the model is arbitrary, and our interest instead is on whether that particular \(X\) variable makes a contribution to explaining the variability of \(Y\) above and beyond all of the other \(X\) variables. The summary table presents them as partial \(t\)-statistics, while the Anova table presents them as partial \(F\)-statistics. Notice the \(p\)-values are identical and \(t^2=F\) for a particular \(X\) variable, where the \(t\)-test has \(df=n-k-1\) and the \(F\)-test has \(df=1,n-k-1\).

# let's see what happens if I enter the 3 X variables in a different order
aq.mixed <- lm(Ozone ~ Wind + Solar.R + Temp,data=aq.data)
summary(aq.mixed) # unchanged
## 
## Call:
## lm(formula = Ozone ~ Wind + Solar.R + Temp, data = aq.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.485 -14.219  -3.551  10.097  95.619 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
## Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
## Solar.R       0.05982    0.02319   2.580  0.01124 *  
## Temp          1.65209    0.25353   6.516 2.42e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.5948 
## F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16
anova(aq.mixed)  # the Type I SS and partial F tests change with a different sequence
## Analysis of Variance Table
## 
## Response: Ozone
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Wind        1  45694   45694 101.854 < 2.2e-16 ***
## Solar.R     1   9055    9055  20.184 1.786e-05 ***
## Temp        1  19050   19050  42.463 2.424e-09 ***
## Residuals 107  48003     449                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(aq.mixed) # the type II SS and partial F tests do NOT change
## Anova Table (Type II tests)
## 
## Response: Ozone
##           Sum Sq  Df F value    Pr(>F)    
## Wind       11642   1 25.9495 1.516e-06 ***
## Solar.R     2986   1  6.6563   0.01124 *  
## Temp       19050   1 42.4630 2.424e-09 ***
## Residuals  48003 107                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Muliple Regression Extras

This document includes material on:

  1. regression diagnostics

  2. standardized beta coefficients (not in textbook)

  3. effect sizes (not in textbook, information given for t-tests and regression)

3.5 Regression Diagnostics

A residual plot is a graph used to detect violations of regression assumptions such as linearity, normality, and equal variances (homoscedasticity). It is typically formed by either graphing the predicted values \(\hat{y_i}\) (i.e. the fitted values) on the \(x\)-axis and the residuals \(e_i\) on the \(y-axis\).

An improved version of this is to use the studentized residuals \(r_{(-i)}\) (also known as jackknife residuals or rstudent residuals), which have been adjusted to have an approximate \(t\)-distribution with \(n-k-2\) df. These are the residuals that would be by leaving each observation out (this is the “jackknife” principle) and refitting the model, then finding and standardizing the residual from predicting the removed data value. This helps “unmask” outliers. You can get these in base R with the rstudent command.

A “rule of thumb” is to “flag” points for further scrutiny when the absolute value of the standardized residual exceeds 3, i.e. \[|r_{(-i)}| > 3\] There are also elaborate tables available with more exact critical values based on Bonferonni-corrected \(p\)-values of the \(t(n-k-2)\) distribution.

  1. Heteroscedasticity (i.e. violation of equal variances): Checked with the Breusch-Pagan test or Levene’s test. A small \(p\)-value indicates a violation, similar to a funnel or horn-shaped pattern in a residual plot. Deleting outliers (if justifiable) or a transformation can help.

  2. Non-normal residuals: Checked with the Shapiro-Wilk test. A small \(p\)-value indicates a violation, similar to a non-linear pattern in the normal probability plot (i.e. qq-plot). Deleting outliers (if justifiable) or a transformation can help.

  3. Autocorrelated residuals. Checked with the Durbin-Watson test. A small \(p\)-value indicates a violation. This often happens when the predictor variable is time, and methods studied in time series analysis can help.

  4. Multicollinearity. Checked with the largest VIF (variance inflation factor). \[VIF=\frac{1}{1-R_j^2}\] where \(R_J^2\) is the r-squared value that would be obtained with a regression of variable \(X_j\) as the response variable and all other \(X_i\)’s as the predictors. A \(VIF>10\) indicates multicollinearity (i.e. predictors that are highly correlated with each other). This can result in inflated standard errors (which will distort hypothesis tests, \(p\)-values, and confidence intervals) and inaccurate parameter estimates, including the “wrong signs” problem.

  5. Outliers. Checked with Cook’s Distance (a measure of the influence of a point, briefly mentioned on page 173 of the text). We’ll use the “rule of thumb” where Cook’s \(D>0.5\) indicates an outlier.

3.6 Standardized Beta Coefficients

Some disciplines (particularly in the social sciences) want to standardize the regression coefficients (i.e. the betas). This is done especially when the various \(X\) variables are measured on very different scales, making it difficult to judge the various slopes solely on their magnitude.

The idea is that a one standard deviation increase in an \(X\) variable will have an impact of a certain number of standard deviations on the \(Y\) variable, given by the standardized beta coefficient, which I will call \(\hat{\beta_j}^*\).

It can be computed: \[\hat{\beta_j}^*=\hat{\beta_j} \frac{s_x}{s_y}\]

Note that this will NOT impact the test statistics or \(p\)-values that you would get from either Type I (sequential) tests that you get from the anova command or the Type II (unique) tests that you get from summary or Anova.

Many people think standardized betas are a bit silly and not worth the trouble, while others think they are absolutely necessary in any multiple regression analysis.

I demonstrated getting the standardized betas in two different ways:

  1. standardizing all of the variables (i.e. centering by subtracting the mean and then dividing by the standard deviation) manually, except for categorical predictors

  2. using a function lm.beta from the lm.beta package to do this for you

The noted statistician Andrew Gelman believes that one should divide by 2 standard deviations, rather than 1, when standardizing for this reason, in order to make the standardized beta coefficients of the continuous predictors comparable with the beta coefficients for categorical predictors.

Read more about that here: stat.columbia.edu/~gelman/research/published/standardizing7.pdf

3.7 Effect Sizes

Hypothesis testing has been controversial for decades. Back when I was in graduate school, a book called “What If There Were No Signficiance Tests” was published, and similar papers were written before I was born. Recently, a journal in social psychology banned \(p\)-values from manuscripts submitted for publication there, and the American Statistical Association released a statements urging us to move beyond just looking at whether \(p < 0.05\).

https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108

https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

In the social sciences, effect sizes have been popular for years as a measure of the size of an effect (the strength of a relatiohship), to either accompany or replace a probabilistic measure such as a \(p\)-value.

I’ll first demonstrate the idea of the effect size using a \(t\)-test, where the most common effect size measure is Cohen’s \(d\) (Jacob Cohen was a well-known quantitative psychologist).

\[d=\frac{\bar{x_1}-\bar{x_2}}{s_{pooled}}\]

This is just the standardized difference between the means. A rule of thumb is to classify this as a “small effect” at \(d=0.2\), “medium effect” at \(d=0.5\) and “large effect” when \(d=0.8\). Similar measures such as Glass’s \(\Delta\) and Hedges’s \(g\) exist.

In multiple regression and ANOVA models, there are several choices. We could just use the \(R^2\) or adjusted \(R^2\), or other variations including Cohen’s \(f^2\), \(\eta^2\), or \(\omega^2\). For \(R^2\) we are getting a measure for the entire model, while for the other statistics it is for a particular predictor variable \(X\). \(f^2\) is using the partial correlation between \(X\) and \(Y\) controlling for the other \(X\) variables.

\[f^2=\frac{r^2}{1-r^2}\]

\[R^2=\frac{SSModel}{SSTotal}\]

\[\eta^2=\frac{SSX}{SSTotal}\]

For convenience, here are all conventional effect sizes for all tests in the pwr package:

Test small medium large
tests for proportions (p) 0.2 0.5 0.8
tests for means (t) 0.2 0.5 0.8
chi-square tests (chisq) 0.1 0.3 0.5
correlation test (r) 0.1 0.3 0.5
anova (anov) 0.1 0.25 0.4
general linear model (f2) 0.02 0.15 0.35

http://psychology.okstate.edu/faculty/jgrice/psyc3214/Ferguson_EffectSizes_2009.pdf (Christopher J. Ferguson’s article on effect sizes)

3.8 Home Prices data

The main data set here is GrinnellHouses from the Stat2Data package. It has data on \(n=929\) homes and 15 variables.

The pertinent variables are:

SalePrice: Sale price of the house (dollars)

LotSize: Lot size (in acres)

SquareFeet: The square footage of the home’s living space.

Bedrooms Number of bedrroms

Baths: Number of bathrooms

I created a binary version of YearBuilt called Old, based on being built before 1970.

data(GrinnellHouses)

Houses <- GrinnellHouses %>%
  dplyr::select(SalePrice,ListPrice,SPLPPct,SquareFeet,LotSize,Bedrooms,Baths,YearBuilt) %>%
  mutate(Old=(YearBuilt<1970)) %>%
  na.omit() %>%
  as_tibble()

print(Houses,width=Inf)
## # A tibble: 732 × 9
##    SalePrice ListPrice SPLPPct SquareFeet LotSize Bedrooms Baths YearBuilt Old  
##        <int>     <int>   <dbl>      <int>   <dbl>    <int> <dbl>     <int> <lgl>
##  1     27000     35000    77.1       1224   0.172        3  1         1900 TRUE 
##  2     30750     35000    87.9       1277   0.207        3  1         1900 TRUE 
##  3     42000     45900    91.5       1079   0.199        3  1         1900 TRUE 
##  4     50000     52500    95.2        912   0.218        3  2         1900 TRUE 
##  5     50000     50000   100         1488   0.17         3  2         1900 TRUE 
##  6     54000     71500    75.5       2160   0.313        4  1.75      1880 TRUE 
##  7     54000     58900    91.7       1034   0.284        3  1.5       1900 TRUE 
##  8     59000     62500    94.4       1008   0.284        2  1         1900 TRUE 
##  9     60000     63000    95.2       1440   0.258        4  0.75      1954 TRUE 
## 10     60500     69000    87.7       2105   0.230        4  2.75      1900 TRUE 
## # ℹ 722 more rows
cor(Houses)
##             SalePrice  ListPrice     SPLPPct  SquareFeet     LotSize
## SalePrice   1.0000000  0.9946302  0.20507667  0.65240470  0.31779326
## ListPrice   0.9946302  1.0000000  0.12653092  0.67015911  0.34455439
## SPLPPct     0.2050767  0.1265309  1.00000000 -0.03760488 -0.06842539
## SquareFeet  0.6524047  0.6701591 -0.03760488  1.00000000  0.17911246
## LotSize     0.3177933  0.3445544 -0.06842539  0.17911246  1.00000000
## Bedrooms    0.4197103  0.4226686  0.04399417  0.53986039  0.08544781
## Baths       0.7030635  0.7135537  0.03446345  0.64260447  0.22365336
## YearBuilt   0.5010841  0.4925000  0.15088770  0.02393692  0.08690342
## Old        -0.4616856 -0.4549292 -0.11609707 -0.14147749 -0.11865833
##               Bedrooms       Baths   YearBuilt        Old
## SalePrice   0.41971034  0.70306353  0.50108412 -0.4616856
## ListPrice   0.42266861  0.71355374  0.49250000 -0.4549292
## SPLPPct     0.04399417  0.03446345  0.15088770 -0.1160971
## SquareFeet  0.53986039  0.64260447  0.02393692 -0.1414775
## LotSize     0.08544781  0.22365336  0.08690342 -0.1186583
## Bedrooms    1.00000000  0.50794001  0.03367967 -0.1243224
## Baths       0.50794001  1.00000000  0.37305341 -0.4151798
## YearBuilt   0.03367967  0.37305341  1.00000000 -0.7259687
## Old        -0.12432241 -0.41517976 -0.72596869  1.0000000
favstats(SalePrice~Old,data=Houses) #from mosaic
##     Old   min     Q1 median     Q3    max     mean       sd   n missing
## 1 FALSE 27100 125000 170000 245000 606000 193827.1 96712.80 205       0
## 2  TRUE 10900  75700 108000 137950 452000 112355.5 56987.11 527       0
t.test(SalePrice~Old,data=Houses,var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  SalePrice by Old
## t = 14.063, df = 730, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  70097.60 92845.55
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            193827.1            112355.5
model0 <- lm(SalePrice~Old,data=Houses)
summary(model0)
## 
## Call:
## lm(formula = SalePrice ~ Old, data = Houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -166727  -43980  -10591   29770  412173 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   193827       4916   39.43   <2e-16 ***
## OldTRUE       -81472       5794  -14.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70380 on 730 degrees of freedom
## Multiple R-squared:  0.2132, Adjusted R-squared:  0.2121 
## F-statistic: 197.8 on 1 and 730 DF,  p-value: < 2.2e-16
anova(model0)
## Analysis of Variance Table
## 
## Response: SalePrice
##            Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Old         1 9.7964e+11 9.7964e+11  197.75 < 2.2e-16 ***
## Residuals 730 3.6163e+12 4.9538e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
require(effectsize)
rsquared(model0) # small .1 medium .25 large .4
## [1] 0.2131536
effectsize::cohens_f(model0) # small .02 medium .15 large .35
## # Effect Size for ANOVA
## 
## Parameter | Cohen's f |      95% CI
## -----------------------------------
## Old       |      0.52 | [0.46, Inf]
## 
## - One-sided CIs: upper bound fixed at [Inf].
rsquared(model0)/(1-rsquared(model0)) # this is f^2, so sqrt this
## [1] 0.270896
effectsize::eta_squared(model0) #  small .1 medium .25 large .4
## # Effect Size for ANOVA
## 
## Parameter | Eta2 |       95% CI
## -------------------------------
## Old       | 0.21 | [0.17, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
# try with different variable besides SalePrice

model1 <- lm(SPLPPct ~ SquareFeet + LotSize + Bedrooms + Baths + factor(Old) ,
             data=Houses)
summary(model1)
## 
## Call:
## lm(formula = SPLPPct ~ SquareFeet + LotSize + Bedrooms + Baths + 
##     factor(Old), data = Houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.404  -2.388   1.034   3.855  17.341 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     94.3463877  1.0863257  86.849  < 2e-16 ***
## SquareFeet      -0.0009551  0.0004932  -1.937  0.05318 .  
## LotSize         -0.1826619  0.0888763  -2.055  0.04021 *  
## Bedrooms         0.5320880  0.3191707   1.667  0.09593 .  
## Baths            0.2446987  0.4758423   0.514  0.60724    
## factor(Old)TRUE -1.7019116  0.5901119  -2.884  0.00404 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.402 on 726 degrees of freedom
## Multiple R-squared:  0.02703,    Adjusted R-squared:  0.02033 
## F-statistic: 4.034 on 5 and 726 DF,  p-value: 0.001284
vif(model1) 
##  SquareFeet     LotSize    Bedrooms       Baths factor(Old) 
##    1.956708    1.059185    1.513656    2.256853    1.254172
# standardize data 

require(lm.beta)
options(scipen=9)
summary(lm.beta(model1))
## 
## Call:
## lm(formula = SPLPPct ~ SquareFeet + LotSize + Bedrooms + Baths + 
##     factor(Old), data = Houses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.404  -2.388   1.034   3.855  17.341 
## 
## Coefficients:
##                   Estimate Standardized Std. Error t value Pr(>|t|)    
## (Intercept)     94.3463877           NA  1.0863257  86.849  < 2e-16 ***
## SquareFeet      -0.0009551   -0.0991731  0.0004932  -1.937  0.05318 .  
## LotSize         -0.1826619   -0.0774334  0.0888763  -2.055  0.04021 *  
## Bedrooms         0.5320880    0.0750853  0.3191707   1.667  0.09593 .  
## Baths            0.2446987    0.0282814  0.4758423   0.514  0.60724    
## factor(Old)TRUE -1.7019116   -0.1182393  0.5901119  -2.884  0.00404 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.402 on 726 degrees of freedom
## Multiple R-squared:  0.02703,    Adjusted R-squared:  0.02033 
## F-statistic: 4.034 on 5 and 726 DF,  p-value: 0.001284
#look at x=age, y=aj
#unstandarized beta coefficient for age is beta.hat1 = -0.007775
#s_x = sd(Data$age) = 18.46676
#s_y = sd(Data$aj) = 2.766055
#beta.star1 = beta.hat1 * s_x/s_y = -0.0519

# effect sizes
rsquared(model1) 
## [1] 0.02703303
anova(model1)
## Analysis of Variance Table
## 
## Response: SPLPPct
##              Df  Sum Sq Mean Sq F value   Pr(>F)   
## SquareFeet    1    43.2   43.24  1.0552 0.304659   
## LotSize       1   120.2  120.22  2.9338 0.087171 . 
## Bedrooms      1   174.5  174.48  4.2577 0.039429 * 
## Baths         1   147.8  147.81  3.6069 0.057936 . 
## factor(Old)   1   340.9  340.86  8.3177 0.004042 **
## Residuals   726 29751.0   40.98                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
effectsize::cohens_f(model1) 
## # Effect Size for ANOVA (Type I)
## 
## Parameter   | Cohen's f (partial) |      95% CI
## -----------------------------------------------
## SquareFeet  |                0.04 | [0.00, Inf]
## LotSize     |                0.06 | [0.00, Inf]
## Bedrooms    |                0.08 | [0.01, Inf]
## Baths       |                0.07 | [0.00, Inf]
## factor(Old) |                0.11 | [0.05, Inf]
## 
## - One-sided CIs: upper bound fixed at [Inf].
effectsize::eta_squared(model1) 
## # Effect Size for ANOVA (Type I)
## 
## Parameter   | Eta2 (partial) |       95% CI
## -------------------------------------------
## SquareFeet  |       1.45e-03 | [0.00, 1.00]
## LotSize     |       4.02e-03 | [0.00, 1.00]
## Bedrooms    |       5.83e-03 | [0.00, 1.00]
## Baths       |       4.94e-03 | [0.00, 1.00]
## factor(Old) |           0.01 | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
require(sjPlot)

plot_model(model1) # this is the simple forest plot

plot_model(model1,type="std") # same with standardized betas

plot_model(model1,type="std2") # same but dividing by 2 sds as per Gelman

# residual plot of fitted vs studentized residuals
require(broom)
augment(model1) %>%
  ggplot(aes(x=.fitted,y=rstudent(model1))) + geom_point(col="blue") +
  geom_hline(yintercept=c(-3,3),color="red",linetype="dashed")