2.7 Packages

Everything which is done in R is done by functions. Commonly used functions are grouped in packages. Installing different packages expand the functionality of R.

To install a package, say dplyr, for the first time, type

install.packages("dplyr")

To load the package, type the package name without quotation

library(dplyr)

Note that we only need to install each package once but we need to load the package each time to access the functions inside the package.

2.7.1 dplyr package

It is very convenient data management package. It can filter data, sort data, select data and generate new columns.

The filter() function is similar to Excel filter function.

df <- data.frame(Price=c(1.2,2.4,3.6,4.8), 
                 month=c(1,2,1,2), 
                 day=c(1,2,3,4))
dplyr::filter(df, month ==1 , day ==3)

##   Price month day
## 1   3.6     1   3

Note that we have added dplyr::’’ is to avoid name conflict of filter

The arrange() is like Excel sorting function. The syntax is arrange(dataframe, variables):

df <- data.frame(school=c("NTU","SMU","NUS"), 
                 rank=c(2,1,3), 
                 size=c(1,3,2))
dplyr::arrange(df, rank, size)

##   school rank size
## 1    SMU    1    3
## 2    NTU    2    1
## 3    NUS    3    2

The default order is ascending. To sort by a descending order, one can simply use desc():

df <- data.frame(school=c("NTU","SMU","NUS"),
                 rank=c(2,1,3),
                 size=c(1,3,2))
dplyr::arrange(df, desc(rank), size)

##   school rank size
## 1    NUS    3    2
## 2    NTU    2    1
## 3    SMU    1    3

The select() is to select columns. It behaves as if Excel hide column function or delete column function.

df <- data.frame(x=c("NTU","SMU","NUS"),
                 rank=c(2,1,3),
                 size=c(1,3,2))
dplyr::select(df, x, rank)

##     x rank
## 1 NTU    2
## 2 SMU    1
## 3 NUS    3

The mutate() function add new columns in the dataframe. It works like *gen function in Stata.

df <- data.frame(product=c("chicken rice", "laska"),
                 revenue=c(10,5), 
                 cost=c(5,4))
mutate(df, profit = revenue-cost, 
                 profit_margin = profit/revenue)

##        product revenue cost profit profit_margin
## 1 chicken rice      10    5      5           0.5
## 2        laska       5    4      1           0.2

To create columns based on some simple condition, we can use ifelse()

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
mutate(df, a = ifelse(x>3, 1,0))

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

To create columns based on some simple condition, we can use case_when.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
mutate(df, a = case_when(x==1 | x==2 ~0,
                         x==3 | x==4 ~1,
                         x==5 | x==6 ~2)
       )

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 1
## 4  d 4 2 3 1
## 5  e 5 1 2 2
## 6  f 6 2 1 2

To create summary by group, we first use group_by() to group the data, and then use summarize() to apply functions.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,4,2,3,6,2))
newdf <- group_by(df, ID)
summarize(newdf, xbar=mean(x), yvar=var(y))

## # A tibble: 2 x 3
##   ID     xbar  yvar
##   <fct> <dbl> <dbl>
## 1 a         3     7
## 2 b         4     1

2.7.2 reshape package

Reshape package is useful to transform data from long format to wide format. The two main functions for reshape is melt and cast.

We first install and load the reshape package.

 install.packages("reshape")
 library(reshape)

The melt() function collapses all columns into rows

df1 <- data.frame(id=c(1,1,2,2),
                  time = c(1,2,1,2),
                  bid=c(1,2,3,4),
                  ask=c(5,6,7,8))
df2 <- melt(df1, id=c("id","time"))
df2

##   id time variable value
## 1  1    1      bid     1
## 2  1    2      bid     2
## 3  2    1      bid     3
## 4  2    2      bid     4
## 5  1    1      ask     5
## 6  1    2      ask     6
## 7  2    1      ask     7
## 8  2    2      ask     8

The cast() function rebuild new table after it is melt. The casting formula: row variable 1 + row variable 2 ~ column variable 1 + column variable 2.

The following code will return original table. It is because the original table has id and time as row variables while bid and ask are column variables.

df3 <- cast(df2, id+time~variable)
df3

##   id time bid ask
## 1  1    1   1   5
## 2  1    2   2   6
## 3  2    1   3   7
## 4  2    2   4   8

The following code will make id and variable as row variable and time as column variables.

df4 <- cast(df2, id+variable~time)
df4

##   id variable 1 2
## 1  1      bid 1 2
## 2  1      ask 5 6
## 3  2      bid 3 4
## 4  2      ask 7 8

The following code will make id as row variable, and time and variable as column variables.

df5 <- cast(df2, id~variable+time)
df5

##   id bid_1 bid_2 ask_1 ask_2
## 1  1     1     2     5     6
## 2  2     3     4     7     8

2.7.3 Application to World Bank Data

To access on World Bank data in R, we first install the wbstats packages:

install.packages("wbstats")
library(wbstats,warn.conflicts = FALSE)

The following code download data description from World bank database. Since it is too large, we just download the description first. Then we will download only the variables we need.

new_cache <- wbcache()

Since the database is huge, we need to search for relevant data. We will need to obtain GDP and population data.

gdp <- wbsearch("gdp.*capita.*US\\$")
pop <- wbsearch("population, total")

After checking the gdp and pop, we decide to download two data series: “SP.POP.TOTL” and “NY.GDP.PCAP.KD”.

download<- wb(country=c("US", "SG"),
              indicator = c("SP.POP.TOTL",
                            "NY.GDP.PCAP.KD"), 
              startdate = 2000, enddate = 2017)

We will use dplyr to clean up the data.

library(dplyr,warn.conflicts = FALSE)

Since the data contains a lot of redundant columns, we first clean the data.

long <- dplyr::select(download, date, 
                      indicator, country,value)
head(long,3)

##   date         indicator   country   value
## 1 2017 Population, total Singapore 5612253
## 2 2016 Population, total Singapore 5607283
## 3 2015 Population, total Singapore 5535002

The data is in long format. We need to use reshape package to change it to wide format.

library(reshape,warn.conflicts = FALSE)

To start, we first melt it and then cast it to the desired format.

temp <- melt(long, id=c("date","indicator","country"))
data <- cast(temp, country + date~indicator)

Finally, we make some decoration.

colnames(data) <- c("country", "year", "gdp", "pop")
data$year <- as.numeric(data$year)
head(data,3)

##     country year      gdp     pop
## 1 Singapore 2000 33390.06 4027887
## 2 Singapore 2001 32191.94 4138012
## 3 Singapore 2002 33242.99 4175950