6.5 Homework assignment I
From the World Bank Database load the cross-sectional data for selected \(20\) countries in year \(2021\) directly into RStudio with respect to \(3\) variables:
a) government expenditure on education (\(\%\) of GDP),
b) expenditure on health per capita (USD) and
c) poverty headcount ratio at national level (\(\%\) of population)
For that purpose use command
a) government expenditure on education (\(\%\) of GDP),
b) expenditure on health per capita (USD) and
c) poverty headcount ratio at national level (\(\%\) of population)
For that purpose use command
wb_data()
from the package wbstats
. Codes of variables and countries can be found on the web page World Development Indicators by checking metadata glossary.
Solution
Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them. The object namedmydataset
is a data frame where rows represent countries, and columns represent variables (cross-sectional data should always be organized in a wide format). In this example, data preparation involves the following steps: naming the rows, removing some redundant columns, renaming columns for easier usage later, checking the structure of the data and finally checking if data are complete (no missing values).
install.packages("wbstats") # installing the package for the first time
library(wbstats) # loading installed package "wbstats" which supports wb_data() command
=wb_data(indicator=c("SE.XPD.TOTL.GD.ZS","SH.XPD.CHEX.PC.CD","SI.POV.NAHC"), # variable codes
mydatasetcountry=c("AUT","BEL","BGR","CZE","HRV","DEU","EST","FRA","GRC","HUN","ITA","POL","PRT","ROU","SVK","SVN","ESP","TUR","CYP","SRB"), # country codes
start_date=2021, # annual data in one year are considered
end_date=2021, # the same as start date due to cross-sectional data
return_wide=TRUE) # if FALSE long format is returned which is not appropriate)
# Formating the object "mydataset" as data frame
=as.data.frame(mydataset)
mydataset# Naming the rows of a data frame by country names
row.names(mydataset)=c(mydataset$country)
# Removing the first 4 columns of a data frame as unnecessary
=mydataset[,-c(1,2,3,4)]
mydataset# Renaming the columns of data frame as "education", "health" and "poverty" (keep in mind ordering of the variables)
colnames(mydataset)=c("education","health","poverty")
# Checking the structure of object "mydataset"
str(mydataset)
# Checking if there are missing values within object "mydataset"
sum(is.na(mydataset)) # if sum returns zero data are complete
\(~~~\)
Provide basic descriptive statistics of all variables in a single table as well as correlation matrix by commands
datasummary()
and datasummary_correlation()
. Start econometric analysis with estimating a bivariate linear model, named first
, to find if education
reduces poverty
in given countries. Visualize a linear dependence by plot()
command to produce a scatter plot. Print the results of estimated model first
by coeftest()
command from the package lmtest
.
Solution
Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them.library(modelsummary) # loading already installed package from the library
datasummary(education+health+poverty~min+max+mean+sd+N,data=mydataset) # basic descriptive statistics
datasummary_correlation(mydataset,fmt=4) # reporting correlations (lower triangular part of the symmetric matrix)
cor(mydataset) # alternatively, correlation matrix can be printed in the console window
=lm(poverty~education,data=mydataset) # estimating the "first" model by lm() command
first
plot(mydataset$education,mydataset$poverty, # first variable is always displayed on x-axis, and second one on y-axis
xlab = "Expenditure on education", # labeling x-axis
ylab = "Poverty headcount ratio", # labeling y-axis
main = "Scatter plot", # title of the plot
col = "blue", # points color
pch = 19) # points shape
abline(first,col="red") # adding a bivariate linear model (colored in red) to the same plot
library(lmtest) # loading already installed package
coeftest(first) # printing the results of object "first" by coeftest() command
\(~~~\)
Estimate the
second
model which includes the health
as additional RHS variable. Estimate the third
model which considers the log transformation of the variable health
. Use coeftest()
command to check results of estimated models or use modelsummary()
command to compare the results of all models in a single table.
Solution
Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them. In the given example, the additional variablehealth
does not improve the first
model or its transformation, although the coefficient with respect to education
retains expected negative sign. Moreover, the coefficients in the second
and third
models are not statistically significant. One of the reason for this could be that education
and health
are highly correlated RHS variables, indicating a serious multicollinearity problem.
=lm(poverty~education+health,data=mydataset)
secondcoeftest(second)
=lm(poverty~education+log(health),data=mydataset)
thirdcoeftest(third)
modelsummary(list("First model"=first,"Second model"=second,"Third model"=third),stars=TRUE,fmt=3)
\(~~~\)
Using
vif()
command, from the car
package, check if multicollinearity problem really exist in both models (second
and third
). Perform diagnostic checking of all three models and conclude which model(s) pass all diagnostic tests (run BP test, BG test of order \(1\) and order \(2\), and JB test for each model separately).
Solution
Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them. From diagnostic checking you can conclude that thefirst
model passes all diagnostic tests, indicating that all assumptions considering error terms (residuals) are fulfilled in the first
model, i.e. error terms are independently and normally distributed with zero mean and constant variance. Rejection of the null hypothesis of BP, BG and JB test supports this conclusion. Although the first
model demonstrates the negative impact of education on poverty and passes diagnostic tests, it is a poor fit. This is most likely due to the selection of heterogeneous countries in the sample (the scatter plot shows that some countries are very far from the linear model). In such cases, it is advisable to increase the sample of countries or select countries that are more similar (homogeneous).
library(car)
vif(second)
vif(third)
# Diagnostic checking of all three models
# Running BP test for each model
bptest(first,studentize=FALSE)
bptest(second,studentize=FALSE)
bptest(third,studentize=FALSE)
# Running BG test for each model (with order 1 and order 2)
bgtest(first,order=1)
bgtest(first,order=2)
bgtest(second,order=1)
bgtest(second,order=2)
bgtest(third,order=1)
bgtest(third,order=2)
# Running JB test for each model (requires loading "tseries" package from the library)
library(tseries)
jarque.bera.test(resid(first))
jarque.bera.test(resid(second))
jarque.bera.test(resid(third))