2.8 Packages

Everything which is done in R is done by functions. Commonly used functions are grouped in packages. Installing different packages expand the functionality of R.

To install a package, say dplyr, for the first time, type

install.packages("dplyr")

To load the package, type the package name without quotation

library(dplyr)

Note that we only need to install each package once but we need to load the package each time to access the functions inside the package.

2.8.1 dplyr package

It is very convenient data management package. It can filter data, sort data, select data and generate new columns.

The filter() function is similar to Excel filter function.

df <- data.frame(Price=c(1.2,2.4,3.6,4.8), 
                 month=c(1,2,1,2), 
                 day=c(1,2,3,4))
dplyr::filter(df, month==1 , day==3)

##   Price month day
## 1   3.6     1   3

Note that we have added ``dplyr::’’ is to avoid name conflict of filter

The arrange() is like Excel sorting function. The syntax is arrange(dataframe, variables):

df <- data.frame(school=c("NTU","SMU","NUS"), 
                 rank=c(2,1,3), 
                 size=c(1,3,2))
dplyr::arrange(df, rank, size)

##   school rank size
## 1    SMU    1    3
## 2    NTU    2    1
## 3    NUS    3    2

The default order is ascending. To sort by a descending order, one can simply use desc():

df <- data.frame(school=c("NTU","SMU","NUS"),
                 rank=c(2,1,3),
                 size=c(1,3,2))
dplyr::arrange(df, desc(rank), size)

##   school rank size
## 1    NUS    3    2
## 2    NTU    2    1
## 3    SMU    1    3

The select() is to select columns. It behaves as if Excel hide column function or delete column function.

df <- data.frame(x=c("NTU","SMU","NUS"),
                 rank=c(2,1,3),
                 size=c(1,3,2))
dplyr::select(df, x, rank)

##     x rank
## 1 NTU    2
## 2 SMU    1
## 3 NUS    3

The mutate() function add new columns in the dataframe. It works like *``gen function in Stata.

df <- data.frame(product=c("chicken rice", "laska"),
                 revenue=c(10,5), 
                 cost=c(5,4))
mutate(df, profit = revenue-cost, 
                 profit_margin = profit/revenue)

##        product revenue cost profit profit_margin
## 1 chicken rice      10    5      5           0.5
## 2        laska       5    4      1           0.2

To create columns based on some simple condition, we can use ifelse()

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
mutate(df, a = ifelse(x>3, 1,0))

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

To create columns based on some simple condition, we can use case_when().

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
mutate(df, a = case_when(x==1 | x==2 ~0,
                         x==3 | x==4 ~1,
                         x==5 | x==6 ~2)
       )

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 1
## 4  d 4 2 3 1
## 5  e 5 1 2 2
## 6  f 6 2 1 2

To create summary by group, we first use group_by() to group the data, and then use summarize() to apply functions.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,4,2,3,6,2))
newdf <- group_by(df, ID)
summarize(newdf, xbar=mean(x), yvar=var(y))

## # A tibble: 2 x 3
##   ID     xbar  yvar
##   <fct> <dbl> <dbl>
## 1 a         3     7
## 2 b         4     1

Note that the output is a tibble instead of dataframe. This is almost the same as dataframe and the default output for the package dplyr.

2.8.2 Tidyr package

Tidyr package is useful to transform data from long format to wide format. The two main functions are gather() and spread():

We first install and load the reshape package.

install.packages("tidyr")
library(tidyr)

The spread(data,key,value) function to split value according to value:

df1 <- data.frame(
  time = c(1,2,1,2),
  key = c("bid", "bid", "ask", "ask"),
  price = c(1,2,3,4)
  )
df2 <- spread(df1,key,price)
df2

##   time ask bid
## 1    1   3   1
## 2    2   4   2

The gather(data,key,value,selection) is reverse of spread(). Note that key and value are strings for the name of new column. Selection is set of columns remain untounched. Use -y to exclude y. Use -y, -z to exclude more variables.

df1 <- data.frame(
  time = c(1,2),
  ask = c(3,4),
  bid = c(1,2)
  )
df2 <- gather(df1, "key", "price",-time)
df2

##   time key price
## 1    1 ask     3
## 2    2 ask     4
## 3    1 bid     1
## 4    2 bid     2

2.8.3 reshape package

Reshape package is an alternative to tidyr package that is also designed to transform data from long format to wide format. The two main functions for reshape are melt() and cast().

We first install and load the reshape package.

 install.packages("reshape")
 library(reshape)

The melt() function collapses all columns into rows

df1 <- data.frame(id=c(1,1,2,2),
                  time=c(1,2,1,2),
                  bid=c(1,2,3,4),
                  ask=c(5,6,7,8))
df2 <- melt(df1, id=c("id","time"))
df2

##   id time variable value
## 1  1    1      bid     1
## 2  1    2      bid     2
## 3  2    1      bid     3
## 4  2    2      bid     4
## 5  1    1      ask     5
## 6  1    2      ask     6
## 7  2    1      ask     7
## 8  2    2      ask     8

The cast() function rebuild new table after it is melt. The casting formula: row variable 1 + row variable 2 ~ column variable 1 + column variable 2.

The following code will return original table. It is because the original table has id and time as row variables while bid and ask are column variables.

df3 <- cast(df2, id+time~variable)
df3

##   id time bid ask
## 1  1    1   1   5
## 2  1    2   2   6
## 3  2    1   3   7
## 4  2    2   4   8

The following code will make id and variable as row variable and time as column variables.

df4 <- cast(df2, id+variable~time)
df4

##   id variable 1 2
## 1  1      bid 1 2
## 2  1      ask 5 6
## 3  2      bid 3 4
## 4  2      ask 7 8

The following code will make id as row variable, and time and variable as column variables.

df5 <- cast(df2, id~variable+time)
df5

##   id bid_1 bid_2 ask_1 ask_2
## 1  1     1     2     5     6
## 2  2     3     4     7     8

2.8.4 Application to World Bank Data

To access on World Bank data in R, we first install the wbstats packages:

install.packages("wbstats")
library(wbstats,warn.conflicts = FALSE)

The following code download data description from World bank database. Since it is too large, we just download the description first. Then we will download only the variables we need.

new_cache <- wbcache()

Since the database is huge, we need to search for relevant data. We will need to obtain GDP per capita in US dollar and population data.

gdp <- wbsearch("gdp.*capita.*US\\$")
pop <- wbsearch("population, total")

After checking the gdp and pop, we decide to download two data series: “SP.POP.TOTL” and “NY.GDP.PCAP.KD”.

download<- wb(country=c("US", "SG"),
              indicator = c("SP.POP.TOTL",
                            "NY.GDP.PCAP.KD"), 
              startdate = 2000, enddate = 2017)

We will use dplyr to clean up the data.

library(dplyr,warn.conflicts = FALSE)

Since the data contains a lot of redundant columns, we first clean the data.

long <- dplyr::select(download, date, 
                      indicator, country, value)
head(long,3)

##   date         indicator   country   value
## 1 2017 Population, total Singapore 5612253
## 2 2016 Population, total Singapore 5607283
## 3 2015 Population, total Singapore 5535002

The data is in long format. We can use in tidyr package to change it to wide format.

library(tidyr)

To start, we use function spread():

data <- spread(long,indicator,value)
head(data,3)

##   date       country GDP per capita (constant 2010 US$) Population, total
## 1 2000     Singapore                           33850.76           4027887
## 2 2000 United States                           44726.97         282162411
## 3 2001     Singapore                           32597.62           4138012

Finally, we make some decoration.

colnames(data) <- c( "year","country","gdp", "pop")
head(data,3)

##   year       country      gdp       pop
## 1 2000     Singapore 33850.76   4027887
## 2 2000 United States 44726.97 282162411
## 3 2001     Singapore 32597.62   4138012

Alternative, we can use reshape package to change it to wide format.

library(reshape,warn.conflicts = FALSE)

To start, we first melt() it and then cast() it to the desired format.

temp <- melt(long, id=c("date","indicator","country"))
data <- cast(temp, country + date~indicator)

Finally, we make some decoration.

colnames(data) <- c("country", "year", "gdp", "pop")
head(data,3)

##     country year      gdp     pop
## 1 Singapore 2000 33850.76 4027887
## 2 Singapore 2001 32597.62 4138012
## 3 Singapore 2002 33565.97 4175950