2.8 Packages
Everything which is done in R is done by functions. Commonly used functions are grouped in packages. Installing different packages expand the functionality of R.
To install a package, say dplyr, for the first time, type
To load the package, type the package name without quotation
Note that we only need to install each package once but we need to load the package each time to access the functions inside the package.
2.8.1 dplyr package
It is very convenient data management package. It can filter data, sort data, select data and generate new columns.
The filter() function is similar to Excel filter function.
df <- data.frame(Price=c(1.2,2.4,3.6,4.8),
month=c(1,2,1,2),
day=c(1,2,3,4))
dplyr::filter(df, month==1 , day==3)
## Price month day
## 1 3.6 1 3
Note that we have added ``dplyr::’’ is to avoid name conflict of filter
The arrange() is like Excel sorting function. The syntax is arrange(dataframe, variables):
df <- data.frame(school=c("NTU","SMU","NUS"),
rank=c(2,1,3),
size=c(1,3,2))
dplyr::arrange(df, rank, size)
## school rank size
## 1 SMU 1 3
## 2 NTU 2 1
## 3 NUS 3 2
The default order is ascending. To sort by a descending order, one can simply use desc():
df <- data.frame(school=c("NTU","SMU","NUS"),
rank=c(2,1,3),
size=c(1,3,2))
dplyr::arrange(df, desc(rank), size)
## school rank size
## 1 NUS 3 2
## 2 NTU 2 1
## 3 SMU 1 3
The select() is to select columns. It behaves as if Excel hide column function or delete column function.
## x rank
## 1 NTU 2
## 2 SMU 1
## 3 NUS 3
The mutate() function add new columns in the dataframe. It works like *``gen function in Stata.
df <- data.frame(product=c("chicken rice", "laska"),
revenue=c(10,5),
cost=c(5,4))
mutate(df, profit = revenue-cost,
profit_margin = profit/revenue)
## product revenue cost profit profit_margin
## 1 chicken rice 10 5 5 0.5
## 2 laska 5 4 1 0.2
To create columns based on some simple condition, we can use ifelse()
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
mutate(df, a = ifelse(x>3, 1,0))
## ID x y z a
## 1 a 1 1 6 0
## 2 b 2 2 5 0
## 3 c 3 1 4 0
## 4 d 4 2 3 1
## 5 e 5 1 2 1
## 6 f 6 2 1 1
To create columns based on some simple condition, we can use case_when().
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
mutate(df, a = case_when(x==1 | x==2 ~0,
x==3 | x==4 ~1,
x==5 | x==6 ~2)
)
## ID x y z a
## 1 a 1 1 6 0
## 2 b 2 2 5 0
## 3 c 3 1 4 1
## 4 d 4 2 3 1
## 5 e 5 1 2 2
## 6 f 6 2 1 2
To create summary by group, we first use group_by() to group the data, and then use summarize() to apply functions.
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,5,6),
y=c(1,4,2,3,6,2))
newdf <- group_by(df, ID)
summarize(newdf, xbar=mean(x), yvar=var(y))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## ID xbar yvar
## <chr> <dbl> <dbl>
## 1 a 3 7
## 2 b 4 1
Note that the output is a tibble instead of dataframe. This is almost the same as dataframe and the default output for the package dplyr.
2.8.2 Tidyr package
Tidyr package is useful to transform data from long format to wide format. The two main functions are gather() and spread():
We first install and load the reshape package.
The spread(data,key,value) function to split value according to value:
df1 <- data.frame(
time = c(1,2,1,2),
key = c("bid", "bid", "ask", "ask"),
price = c(1,2,3,4)
)
df2 <- spread(df1,key,price)
df2
## time ask bid
## 1 1 3 1
## 2 2 4 2
The gather(data,key,value,selection) is reverse of spread(). Note that key and value are strings for the name of new column. Selection is set of columns remain untounched. Use -y to exclude y. Use -y, -z to exclude more variables.
df1 <- data.frame(
time = c(1,2),
ask = c(3,4),
bid = c(1,2)
)
df2 <- gather(df1, "key", "price",-time)
df2
## time key price
## 1 1 ask 3
## 2 2 ask 4
## 3 1 bid 1
## 4 2 bid 2
2.8.3 reshape package
Reshape package is an alternative to tidyr package that is also designed to transform data from long format to wide format. The two main functions for reshape are melt() and cast().
We first install and load the reshape package.
The melt() function collapses all columns into rows
df1 <- data.frame(id=c(1,1,2,2),
time=c(1,2,1,2),
bid=c(1,2,3,4),
ask=c(5,6,7,8))
df2 <- melt(df1, id=c("id","time"))
df2
## id time variable value
## 1 1 1 bid 1
## 2 1 2 bid 2
## 3 2 1 bid 3
## 4 2 2 bid 4
## 5 1 1 ask 5
## 6 1 2 ask 6
## 7 2 1 ask 7
## 8 2 2 ask 8
The cast() function rebuild new table after it is melt. The casting formula: row variable 1 + row variable 2 ~ column variable 1 + column variable 2.
The following code will return original table. It is because the original table has id and time as row variables while bid and ask are column variables.
## id time bid ask
## 1 1 1 1 5
## 2 1 2 2 6
## 3 2 1 3 7
## 4 2 2 4 8
The following code will make id and variable as row variable and time as column variables.
## id variable 1 2
## 1 1 bid 1 2
## 2 1 ask 5 6
## 3 2 bid 3 4
## 4 2 ask 7 8
The following code will make id as row variable, and time and variable as column variables.
## id bid_1 bid_2 ask_1 ask_2
## 1 1 1 2 5 6
## 2 2 3 4 7 8
2.8.4 Application to World Bank Data
To access on World Bank data in R, we first install the wbstats packages:
The following code download data description from World bank database. Since it is too large, we just download the description first. Then we will download only the variables we need.
Since the database is huge, we need to search for relevant data. We will need to obtain GDP per capita in US dollar and population data.
After checking the gdp and pop, we decide to download two data series: “SP.POP.TOTL” and “NY.GDP.PCAP.KD”.
download<- wb(country=c("US", "SG"),
indicator = c("SP.POP.TOTL",
"NY.GDP.PCAP.KD"),
startdate = 2000, enddate = 2017)
We will use dplyr to clean up the data.
Since the data contains a lot of redundant columns, we first clean the data.
## date indicator country value
## 1 2017 Population, total Singapore 5612253
## 2 2016 Population, total Singapore 5607283
## 3 2015 Population, total Singapore 5535002
The data is in long format. We can use in tidyr package to change it to wide format.
To start, we use function spread():
## date country GDP per capita (constant 2010 US$) Population, total
## 1 2000 Singapore 33850.75 4027887
## 2 2000 United States 44726.97 282162411
## 3 2001 Singapore 32597.64 4138012
Finally, we make some decoration.
## year country gdp pop
## 1 2000 Singapore 33850.75 4027887
## 2 2000 United States 44726.97 282162411
## 3 2001 Singapore 32597.64 4138012
Alternative, we can use reshape package to change it to wide format.
To start, we first melt() it and then cast() it to the desired format.
Finally, we make some decoration.
## country year gdp pop
## 1 Singapore 2000 33850.75 4027887
## 2 Singapore 2001 32597.64 4138012
## 3 Singapore 2002 33565.97 4175950