<- readxl::read_excel("infrastructure-data-may-2020.xlsx", sheet = "cu$inv",skip = 2) df_inv_raw
BEA Infrastructure Investment
Introduction
The measurement of infrastructure in the U.S. National Economic Accounts (NEAs) considers different metrics, provided are the resources devoted to different types of infrastructure each year and a useful overview of trends.
Classification of investment categories is provided by North American Industry Classification System (NAICS)
Types of infrastructure:
- Basic: transportation and utilities
- Social: pubblic safety, education and health
- Digital: it excludes servers owned by private firms outside of NAICS 518 and 519
the analysis spans through 70 Years of investments from 1947 to 2017.
Data
This analysis starts with analysing the classification of the investment categories by looking at:
Original data
The original datasets provided by the website are very useful to check whether the category and the meta category variables are homogeneous within the data sets:
1 - Investment data set:
Investment data set is extracted from the original .xlsx file and shows the basic structure of the group categories on which the analysis is based.
<- df_inv_raw %>%
df_inv_raw rename(group = ...1, category = ...2) %>%
filter(!is.na(category)) %>%
mutate(
meta_cat = if_else(!is.na(group), category, NA_character_),
group_num = group,
.after = "category"
)
::datatable(df_inv_raw %>% select(meta_cat,group,category)) DT
In addtion it contains the gross investment variable which is the variable that will be used in the analysis as one of the factors to construct the Implicit Price Deflators index.
names(investment)
2 - Chain investment data set:
It is extracted as the same as the investment set and it contains the same group of category variables except for one element in the category which is missing, it will be shown in more details further below in the analysis.
It doesn’t contains the gross investment variable but it the gross investment chain instead.
names(chain_investment)
3 - IPD: Implicit Price Deflators data set:
::datatable(df_IPD_inv_raw %>% select(meta_cat,group,category)) DT
<- df_IPD_inv_raw %>%
ipd fill(meta_cat, group_num) %>%
pivot_longer(names_to = "year", values_to = "gross_inv_ipd", cols = `1947`:`2017`,
names_transform = list(year = as.integer)) %>%
filter(is.na(group)) %>%
select(-group)
IPD set contains one more meta category as well as one more group number for the GDP element of the vector, which corresponds to group category number 0.
In fact the script for the IPD variable changes slightly with the addition of one more line of code.
It also contains the gross investment ipd variables whixh is the object of this study.
<- ipd %>%
ipd mutate(meta_cat = if_else(category == "GDP", "GDP", meta_cat))
names(ipd)
TidyTuesday data
Data can also be loaded via:
<- tidytuesdayR::tt_load(2021, week = 33)
tuesdata ::readme(tuesdata) tidytuesdayR
Meta Categories and sub-groups
The original data sets are very useful to check whether the investments classified within their categories are homogeneous within the other data sets.
The analysis starts with checking of the three data sets, to see what are the common categories and if there are any differences. Next step would be to unify the three data sets to make a model for the Implicit Price Deflators.
The first data set Investment is the one used for understanding the composition of the variables. The other two are quite similar except for the investment variable, and two other differences in group category.
Let’s start having a look at any missing values:
<- tuesdata$investment
investment ::profile_missing(investment) DataExplorer
No values are missing in investment which is made of 5 variables:
- category: Category of investment (60 categories)
- meta_cat: Group category of investment (16 meta categories)
- group_num: Group number of investment (1 to 20 number of sub-group of investment category)
- year: Year of investment (from 1947 to 2017)
- gross_inv: Gross investment in millions of USD (which ranges from -194 to 500 900 millions $)
The meta_cat
vector is the Group category of investment, and it has a sub-group numeric version named group_num
, and a sub-sub-group of more specific categories of investments named: category
.
The last variable is gross_inv
i.e. Gross investment which is the sum of gross private domestic investment, government gross investment, and balance on current account, national income and product accounts.
::datatable(head(investment,3)) DT
Group category of investment
16 meta_cat
:
::datatable(investment%>%count(meta_cat)) DT
<- investment %>%
investment mutate(meta_cat = case_when(meta_cat=="Total basic infrastructure" ~ "Basic",
TRUE~meta_cat ))
Group number of investment
20 sub-group of the meta_cat
corresponding to group_num
:
::datatable(investment %>% count(group_num)) DT
Category of investment
60 sub-sub-group category
vector:
::datatable(investment %>% count(category)) DT
The list of category
contains a specification of the classification method for some of the elements in the vector, such as:
- Private communications equipment in NAICS 515, 517, 518, and 519
- Private computers in NAICS 515, 517, 518, and 519
- Office buildings, NAICS 518 and 519
- Private software in NAICS 515, 517, 518, and 519
These elements are part of the Digital meta category of investment and the numbers at the ends are to specify that they are classified by the NAICS and only those classified in North America are accepted within the list of digital investments. So, other form of digital investments are not considered in this study.
<- investment %>%
investment mutate(category = case_when(
=="Private communications equipment in NAICS 515, 517, 518, and 519"~"Private communications equipment",
category=="Private computers in NAICS 515, 517, 518, and 519"~"Private computers",
category=="Office buildings, NAICS 518 and 519"~"Office buildings",
category=="Private software in NAICS 515, 517, 518, and 519"~"Private software",
categoryTRUE~category))
Exploratory analysis - Investment
Exploratory analysis on Gross investments shows an increase in investments in the private sector as the most flourishing category within the last 70 years, followed by basic, S&L and social investments in infrastructures. Digital infrastructure and transports are still below 200 000 millions $. To be noted is that the private category for gross investment is still divided by amount of investments, but as a whole it releases the stronger increase over time.
plot1
A second visualization of the gross investment shows the category with the highest level of gross investment within the last 70 years in millions $.
plot2
plot3
Before continuing with other visualizations of investment trends, let’s check the other data sets to see the differences and then decide if to make a unified dataframe to use for comparison of gross investments, chained investments with implicit price deflactors (IPDs).
The first 4 variables are in common within all datasets, the next step is to check whether there are differences within the first 4 variables.
The second set doesn’t contains differences in classification when compared with investment set.
<- tuesdata$chain_investment
chain_investment ::profile_missing(chain_investment) DataExplorer
<- chain_investment%>%
chain_investment mutate(meta_cat = case_when(meta_cat=="Total basic infrastructure" ~ "Basic",
TRUE~meta_cat ))
59 chain investment categories:
<- chain_investment %>%
chain_investment mutate(category=case_when(category=="Private communications equipment in NAICS 515, 517, 518, and 519"~"Private communications equipment",
=="Private computers in NAICS 515, 517, 518, and 519"~"Private computers",
category=="Office buildings, NAICS 518 and 519"~"Office buildings",
category=="Private software in NAICS 515, 517, 518, and 519"~"Private software",
categoryTRUE~category))
“Office buildings” is the only category in “investment” df which doesn’t appear in “chain_investment” df.
<- chain_investment %>% count(category) %>% select(-n) %>% unlist()
cat %>% filter(!category %in% cat) %>%
investment count(category)
Implicit Price Deflators (IPDs). An implicit price deflator is the ratio of the current-dollar value of a series, such as gross domestic product (GDP), to its corresponding chained-dollar value, multiplied by 100.
<- tuesdata$ipd
ipd ::profile_missing(ipd) DataExplorer
<- investment%>%count(meta_cat)%>%select(-n)%>%unlist()
meta_inv %>%filter(!meta_cat%in%meta_inv)%>%count(meta_cat) ipd
<- ipd%>%
ipd mutate(meta_cat = case_when(meta_cat=="Total basic infrastructure" ~ "Basic",
=="Infrastructure"~"Total infrastructure",
meta_cat=="Health care"~"Health",
meta_catTRUE~meta_cat ))
<- ipd %>%
ipd mutate(category=case_when(category=="Private communications equipment in NAICS 515, 517, 518, and 519"~"Private communications equipment",
=="Private computers in NAICS 515, 517, 518, and 519"~"Private computers",
category=="Office buildings, NAICS 518 and 519"~"Office buildings",
category=="Private software in NAICS 515, 517, 518, and 519"~"Private software",
categoryTRUE~category))
60 categories are in the “Implicit Price Deflators” df
<-ipd %>% count(category)%>%unlist() ipd_cats
comparing it with the “investment” and “chain investment” categories, 6 of those have a slightly different name, the best wat to handle this for obtaining a unified data set made of the three sets is to have homogeneous values for the category variable.
These are the 6 categories in ipd to be changed:
<- investment%>%count(category)%>%select(-n)%>%unlist()
cat_investment <- ipd%>%filter(!category%in%cat_investment)%>%count(category)%>%select(-n)%>%unlist()
ipd_cats_to_be_renamed
ipd_cats_to_be_renamed
These 6 categories are in the investment data set and we want to use these names for the category variable for all the sets:
<- investment %>% filter(!category %in% ipd_cats) %>% count(category,group_num) %>% arrange(group_num)
selected_categories
selected_categories
<- ipd %>%
ipd mutate(category = case_when(
== "Basic" ~ "Total basic infrastructure",
category == "Social" ~ "Total social infrastructure",
category == "Digital" ~ "Total digital infrastructure",
category == "Health care" ~ "Health",
category == "Communications structures" ~ "Private communications structures",
category TRUE ~ category
))
Let’s see how meta_cat
of these 6 categories perform:
plot4
<- ipd %>%
total_inv_df count(category,meta_cat,year,gross_inv_ipd) %>%
group_by(year,meta_cat) %>%
summarize(total=ifelse(!is.na(round(sum(gross_inv_ipd))),round(sum(gross_inv_ipd)), 0)) %>%
ungroup() %>%
filter(!total==0)
library(extrafont)
library(showtext)
showtext_auto()
showtext_opts(dpi = 320)
font_add_google("Roboto Condensed", "roboto condensed")
plot5
library(patchwork)
library(ggpubr)
library(cowplot)
<- png::readPNG('images.png')
img
<- ggarrange(plot5)
graphics
<- annotate_figure(graphics,
final_plot top = text_grob("IPDs Group investment categories\n",
color = c("#FFD700"), face = "bold", size = 34,
family="roboto condensed"),
bottom = text_grob("BEA: measurement of infrastructure in the U.S. National Economic Accounts (NEAs)\nInfographic: @fgazzelloni\n DataSource: TidyTuesday Week33: BEA Infrastructure Investment",
color = "#6C7B8B",family="roboto condensed",
hjust = 0.5, x = 0.5, face = "bold.italic", size = 10),
left = text_grob("", color = c("#778899"), rot = 90,size=10),
right = text_grob(bquote(""), color=c("#778899"),rot = 90,size=10),
fig.lab = "TidyTuesday week33", fig.lab.face = "bold.italic",fig.lab.size=8,
fig.lab.pos="bottom.right"
)
<- final_plot +
final_plot annotate(geom = "text", label="values of the implicit price deflator ratio",x = 0.25, y = 0.92,
colour = "black", face="bold",size = 7,family="roboto condensed") +
annotate(geom = "text", label="Digital just started in 1978 and followed \nwith a deep decline during the last 40 years",x = 0.70, y = 0.68, colour = "#FF7F00", size = 5,family="roboto condensed")
<- ggdraw() +
final draw_image(img, x = -0.4, y = -0.48, scale = .15) +
draw_plot(final_plot)
finalshowtext.auto(enable = FALSE)
::agg_png("w33_bea2.png",
raggres = 320, width = 14, height = 8, units = "in")
final
dev.off()
Investment on Electric power only appear starting from 1993 with a ratio of 336:
<-total_inv_df %>% filter(year==1992)%>%count(meta_cat)%>%select(-n)%>%unlist()
list%>% filter(year==1993,!meta_cat%in%list) total_inv_df
meta_cat are the Group category of investment, there are 4 values which are different within the sets:
- Health care = Health
- Infrastructure = Total infrastructure
- Basic = Total basic infrastructure
- GDP
We need to add GDP as further meta_cat for both investment and chain_investment sets, and chenge the other three values into common values as shown above.
<- investment %>% count(meta_cat) %>% select(-n) %>% unlist()
inv_meta_cats
%>% filter(!meta_cat %in% inv_meta_cats) %>% count(meta_cat) ipd
GDP is the only left meta_cat category, that it is not present in the other sets.
%>% filter(!meta_cat %in% inv_meta_cats) %>% count(meta_cat) ipd
What about category variable? What are the differences in ipd set when compared with the other two?
As seen investment and chain_investment sets have the same category variables except for one variable which is “Office buildings”, so next step is to find the differences of this vector in ipd set:
<- investment %>% count (category) %>% select(-n) %>% unlist
cat_inv %>% filter(!category %in% cat_inv) %>% count(category) ipd
we need to convert these values in ipd set:
- Basic = Total basic infrastructure
- Communications structures = Private communications structures
- Digital = Total digital infrastructure
- GDP
- Health care = Health
- Social = Total social infrastructure
Finally, GDP is the category group number 0 which is not in investment and chain investment sets, but only in IPD. While “Office buildings” is only in investment.
<- ipd %>%
ipd mutate(category = case_when(
=="Basic"~"Total basic infrastructure",
category=="Communications structures"~"Private communications structures",
category=="Digital"~"Total digital infrastructure",
category=="Health care"~"Health",
category=="Social"~"Total social infrastructure",
categoryTRUE~category))
%>% filter(!category %in% cat_inv) %>% count(category) ipd
The new data sets will contain the 4 common variables plus these other three:
- gross_inv: Gross investment in millions of USD
- gross_inv_chain: Gross investment (chained 2021 dollars) in millions of USD
- gross_inv_ipd: Implicit Price Deflators (IPDs)
The Implicit Price Deflators (IPDs) is obtained as an index result of:
[ (gross_inv *100) / (gross_inv_chain *100) ] *100
The ipd set contains some missing values, these values are the results of 0 investments for some of the selected categories within the years.
%>% filter(is.na(gross_inv_ipd)) %>% count(meta_cat,category) ipd
Let’s calculate the gross_inv_ipd vector with the formula, assigning its value to a different name vector and then check the missing values. To do this we need to join the three sets together:
<- investment%>%
investment_full full_join(chain_investment,by=c("category","meta_cat","group_num","year")) %>%
full_join(ipd,by=c("category","meta_cat","group_num","year")) %>%
#filter(meta_cat=="GDP"| category=="Office buildings") %>%
mutate(gross_inv_chain = if_else(category=="Office buildings",0,gross_inv_chain),
gross_inv_ipd = if_else(gross_inv<=0,0,gross_inv_ipd),
gross_inv_ipd2 = if_else(gross_inv_chain==0,0,((gross_inv/100)/(gross_inv_chain/100))*100),
gross_inv_ipd2 = round(gross_inv_ipd2,2))
#DataExplorer::profile_missing()