6.5 Homework assignment

From the World Bank Database load the cross-sectional data for selected \(20\) countries in year \(2021\) directly into RStudio with respect to \(3\) variables:
a) government expenditure on education (\(\%\) of GDP),
b) expenditure on health per capita (USD) and
c) poverty headcount ratio at national level (\(\%\) of population)
For that purpose use command wb_data() from the package wbstats. Codes of variables and countries can be found on the web page World Development Indicators by checking metadata glossary.

Solution

Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them. The object named mydataset is a data frame where rows represent countries, and columns represent variables (cross-sectional data should always be organized in a wide format). In this example, data preparation involves the following steps: naming the rows, removing some redundant columns, renaming columns for easier usage later, checking the structure of the data and finally checking if data are complete (no missing values).

install.packages("wbstats") # installing the package for the first time
library(wbstats) # loading installed package "wbstats" which supports wb_data() command
mydataset=wb_data(indicator=c("SE.XPD.TOTL.GD.ZS","SH.XPD.CHEX.PC.CD","SI.POV.NAHC"), # variable codes
                  country=c("AUT","BEL","BGR","CZE","HRV","DEU","EST","FRA","GRC","HUN","ITA","POL","PRT","ROU","SVK","SVN","ESP","TUR","CYP","SRB"), # country codes
                  start_date=2021, #  annual data in one year are considered
                  end_date=2021, # the same as start date due to cross-sectional data
                  return_wide=TRUE) # if FALSE long format is returned which is not appropriate)

# Formating the object "mydataset" as data frame
mydataset=as.data.frame(mydataset)
# Naming the rows of a data frame by country names
row.names(mydataset)=c(mydataset$country)
# Removing the first 4 columns of a data frame as unnecessary
mydataset=mydataset[,-c(1,2,3,4)]
# Renaming the columns of data frame as "education", "health" and "poverty" (keep in mind ordering of the variables)
colnames(mydataset)=c("education","health","poverty")
# Checking the structure of object "mydataset"
str(mydataset)
# Checking if there are missing values within object "mydataset"
sum(is.na(mydataset)) # if sum returns zero data are complete

\(~~~\)

Provide basic descriptive statistics of all variables in a single table as well as correlation matrix by commands datasummary() and datasummary_correlation(). Start econometric analysis with estimating a bivariate linear model, named first, to find if education reduces poverty in given countries. Visualize a linear dependence by plot() command to produce a scatter plot. Print the results of estimated model first by coeftest() command from the package lmtest.

Solution

Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them.

library(modelsummary) # loading already installed package from the library
datasummary(education+health+poverty~min+max+mean+sd+N,data=mydataset) # basic descriptive statistics
datasummary_correlation(mydataset,fmt=4) # reporting correlations (lower triangular part of the symmetric matrix)
cor(mydataset) # alternatively, correlation matrix can be printed in the console window
first=lm(poverty~education,data=mydataset) # estimating the "first" model by lm() command

plot(mydataset$education,mydataset$poverty, # first variable is always displayed on x-axis, and second one on y-axis
     xlab = "Expenditure on education", # labeling x-axis
     ylab = "Poverty headcount ratio", # labeling y-axis
     main = "Scatter plot", # title of the plot
     col = "blue", # points color
     pch = 19)  # points shape
abline(first,col="red") # adding a bivariate linear model (colored in red) to the same plot

library(lmtest) # loading already installed package
coeftest(first) # printing the results of object "first" by coeftest() command

\(~~~\)

Estimate the second model which includes the health as additional RHS variable. Estimate the third model which considers the log transformation of the variable health. Use coeftest() command to check results of estimated models or use modelsummary() command to compare the results of all models in a single table.

Solution

Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them. In the given example, the additional variable health does not improve the first model or its transformation, although the coefficient with respect to education retains expected negative sign. Moreover, the coefficients in the second and third models are not statistically significant. One of the reason for this could be that education and health are highly correlated RHS variables, indicating a serious multicollinearity problem.

second=lm(poverty~education+health,data=mydataset)
coeftest(second)
third=lm(poverty~education+log(health),data=mydataset)
coeftest(third)
modelsummary(list("First model"=first,"Second model"=second,"Third model"=third),stars=TRUE,fmt=3)

\(~~~\)

Using vif() command, from the car package, check if multicollinearity problem really exist in both models (secondand third). Perform diagnostic checking of all three models and conclude which model(s) pass all diagnostic tests (run BP test, BG test of order \(1\) and order \(2\), and JB test for each model separately).

Solution

Copy the code lines below to the clipboard, paste them into an R Script file opened in RStudio, and run them. From diagnostic checking you can conclude that the first model passes all diagnostic tests, indicating that all assumptions considering error terms (residuals) are fulfilled in the first model, i.e. error terms are independently and normally distributed with zero mean and constant variance. Rejection of the null hypothesis of BP, BG and JB test supports this conclusion. Although the first model demonstrates the negative impact of education on poverty and passes diagnostic tests, it is a poor fit. This is most likely due to the selection of heterogeneous countries in the sample (the scatter plot shows that some countries are very far from the linear model). In such cases, it is advisable to increase the sample of countries or select countries that are more similar (homogeneous).

library(car)
vif(second)
vif(third)

# Diagnostic checking of all three models
# Running BP test for each model
bptest(first,studentize=FALSE) 
bptest(second,studentize=FALSE)
bptest(third,studentize=FALSE)

# Running BG test for each model (with order 1 and order 2)
bgtest(first,order=1)
bgtest(first,order=2)
bgtest(second,order=1)
bgtest(second,order=2)
bgtest(third,order=1)
bgtest(third,order=2)

# Running JB test for each model (requires loading "tseries" package from the library)
library(tseries)
jarque.bera.test(resid(first))
jarque.bera.test(resid(second))
jarque.bera.test(resid(third))