5.2 OLS method in RStudio
Instead of performing matrix calculations, it’s easier to utilize the
lm()
command for various linear model specifications, which provides OLS estimates by default.Command
lm()
has two main arguments: (1) formula and (2) dataThe formula argument of the
lm()
command refers to the functional specification of the model and can be written in different ways
Formula | Description |
---|---|
y~x |
\(y\) is regressed on \(x\) |
y~x+z |
\(y\) is regressed on \(x\) and \(z\) |
y~x+z+x:z |
\(y\) is regressed on \(x\), \(z\) and interaction term (\(x*z\)) |
y~0+z |
\(y\) is regressed on \(z\) without constant term |
y~1 |
\(y\) is regressed on constant term only |
y~log(x) |
\(y\) is regressed on the \(log(x)\) |
scale(y)~0+scale(x) |
standardized regression without constant term |
The second argument presents data frame (previously saved object within RStudio)
Sometimes argument
data
can be omitted if the variables exist in the global workspace environment (variables are not part of any data frame)From estimated model various information can be extracted by applying new commands on existing object
Command | Description |
---|---|
coef(model) |
regression coefficients (estimated parameters) |
confint(model,level=0.95) |
confidence intervals of coefficients |
fitted(model) |
fitted (expected) values |
resid(model) |
residuals |
vcov(model) |
covariance matrix of coefficients |
summary(model) |
basic summary of the model |
anova(model) |
analysis of variance ANOVA |
AIC(model) |
Akaike information criterion |
BIC(model) |
Bayesian information criterion |
abline(model) |
regression line on the scatter plot |
Exercise 23. Using sample data from newdata
object (already loaded a text file eu_countries.txt
) estimate three models: (a) lin-lin, (b) log-log, and (c) second order polynomial. Assume that gdp
is dependent variable \(y\) and population
is independent variable \(x\). Summarize the results of three estimated models in a single table using modelsummary()
command.
=lm(gdp~population,data=newdata) # Estimation of lin-lin model as an object "model1"
model1=lm(log(gdp)~log(population),data=newdata) # Estimation of log-log model as an object "model2"
model2=lm(gdp~population+I(population^2),data=newdata) # Estimation of polynomial model as object "model3"
model3modelsummary(list(model1,model2,model3),stars=TRUE,fmt=4) # Summarizing results from three models in a single table
\(~~~\)
Note that
I()
function is used to indicate that operations inside should be treated “as-is”, i.e. it ensures thatpopuation^2
is explicitly interpreted as quadratic term.It is commonly accepted that estimated parameter is marked with one star (\(*\)) if it is statistically significant at 10% level (\(p<0.1\)), two stars (\(**\)) if it is statistically significant at 5% level (\(p<0.05\)) or three stars (\(***\)) if it is statistically significant at 1% level (\(p<0.01\)).
Estimates without star(s) are not statistically significant in the context of two-sided alternative (this will be discussed in the section 6.1)