5.2 OLS method in RStudio

  • Instead of performing matrix calculations, it’s easier to utilize the lm() command for various linear model specifications, which provides OLS estimates by default.

  • Command lm() has two main arguments: (1) formula and (2) data

  • The formula argument of the lm() command refers to the functional specification of the model and can be written in different ways

TABLE 5.5: Formula specifications in RStudio
Formula Description
y~x \(y\) is regressed on \(x\)
y~x+z \(y\) is regressed on \(x\) and \(z\)
y~x+z+x:z \(y\) is regressed on \(x\), \(z\) and interaction term (\(x*z\))
y~0+z \(y\) is regressed on \(z\) without constant term
y~1 \(y\) is regressed on constant term only
y~log(x) \(y\) is regressed on the \(log(x)\)
scale(y)~0+scale(x) standardized regression without constant term
  • The second argument presents data frame (previously saved object within RStudio)

  • Sometimes argument data can be omitted if the variables exist in the global workspace environment (variables are not part of any data frame)

  • From estimated model various information can be extracted by applying new commands on existing object

TABLE 5.6: Extracted results from linear model in RStudio
Command Description
coef(model) regression coefficients (estimated parameters)
confint(model,level=0.95) confidence intervals of coefficients
fitted(model) fitted (expected) values
resid(model) residuals
vcov(model) covariance matrix of coefficients
summary(model) basic summary of the model
anova(model) analysis of variance ANOVA
AIC(model) Akaike information criterion
BIC(model) Bayesian information criterion
abline(model) regression line on the scatter plot

Exercise 23. Using sample data from newdata object (already loaded a text file eu_countries.txt) estimate three models: (a) lin-lin, (b) log-log, and (c) second order polynomial. Assume that gdp is dependent variable \(y\) and population is independent variable \(x\). Summarize the results of three estimated models in a single table using modelsummary() command.

model1=lm(gdp~population,data=newdata) # Estimation of lin-lin model as an object "model1"
model2=lm(log(gdp)~log(population),data=newdata) # Estimation of log-log model as an object "model2"
model3=lm(gdp~population+I(population^2),data=newdata) # Estimation of polynomial model as object "model3"
modelsummary(list(model1,model2,model3),stars=TRUE,fmt=4) # Summarizing results from three models in a single table

\(~~~\)

  • Note that I() function is used to indicate that operations inside should be treated “as-is”, i.e. it ensures that popuation^2 is explicitly interpreted as quadratic term.

  • It is commonly accepted that estimated parameter is marked with one star (\(*\)) if it is statistically significant at 10% level (\(p<0.1\)), two stars (\(**\)) if it is statistically significant at 5% level (\(p<0.05\)) or three stars (\(***\)) if it is statistically significant at 1% level (\(p<0.01\)).

  • Estimates without star(s) are not statistically significant in the context of two-sided alternative (this will be discussed in the section 6.1)