2.7 Data Management
We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subsetting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.
2.7.1 Loading built-in dataset
Load a standard dataset and store it as a dataframe
df <- data.frame(mtcars)
2.7.2 Peeping Data
We may use the head() to checking the top rows and tail() to check bottom rows.
Here is the example data:
x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)
Here we look at the first three row:
head(df, 3)
## x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2
Here we look at the last three row:
tail(df, 3)
## x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3
2.7.3 Sorting Data
Sorting data by giving multiple criteria using order.
The following code first sort the data by y (in an ascending order) and then break tie using z if they have the same y.
newdf <- df[order(df$y,df$z),]
newdf
## x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1
2.7.4 Joining dataframes
To merge two dataframes horizontally by joining through an unique identifier, one may use merge().
df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3
## ID x y
## 1 a 1 4
## 2 b 2 3
If two dataframes are just to join horizontally without an unique identifier, then use cbind().
df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-cbind(df1,df2)
df3
## ID x ID y
## 1 a 1 b 3
## 2 b 2 a 4
2.7.5 Stacking Dataframe
If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.
df1<-data.frame(ID=c("a","b"),
x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"),
x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
## 4 d 4 2
## 5 e 5 1
## 6 f 6 2
2.7.6 Selecting Columns (Variables)
The following code selecting columns 1 to 3.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
The following code drops columns 1 to 2.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)
## y z
## 1 1 6
## 2 2 5
## 3 1 4
We can delete columns by name.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
2.7.7 Selecting Rows (observation) based on row number
The functions head() and tail() allow as to directly obtain rows from the top and bottom.
The following code selects the top 3 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf
## ID x y z
## 1 a 1 1 6
## 2 b 2 2 5
## 3 c 3 1 4
The following code drops the last 2 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf
## ID x y z
## 1 a 1 1 6
## 2 b 2 2 5
## 3 c 3 1 4
## 4 d 4 2 3
The following code selects the bottom 3 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf
## ID x y z
## 4 d 4 2 3
## 5 e 5 1 2
## 6 f 6 2 1
The following code drops the top 2 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf
## ID x y z
## 3 c 3 1 4
## 4 d 4 2 3
## 5 e 5 1 2
## 6 f 6 2 1
The following code selects observations from row 2 to row 4.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)
## ID x y z
## 2 b 2 2 5
## 3 c 3 1 4
## 4 d 4 2 3
The following code excludes observations from row 2 to row 4.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)
## ID x y z
## 1 a 1 1 6
## 5 e 5 1 2
## 6 f 6 2 1
2.7.8 Selecting Rows (observation) based on condition
To select observations based on conditions, we may use which. The following code selects observations such that \(y=1\).
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf
## ID x y z
## 1 a 1 1 6
## 3 c 3 1 4
## 5 e 5 1 2
Here, which(df$y==1) returns a vector of row numbers such that \(y=1\).
The following code chooses observation that \(y=1\) and \(x>1\).
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf
## ID x y z
## 3 c 3 1 4
## 5 e 5 1 2
Here, which(df$y==1 & df$x>1) returns a vector of row numbers such that both \(y=1\) and \(x>1\).
2.7.9 Selecting rows (observations) and columns (variables)
To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use ``select’’.
The following code selects columns of y and z when \(y=1\) and \(x>1\).
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf
## y z
## 2 2 5
## 4 2 3
## 6 2 1
2.7.10 Create New Column
If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df
## ID x y z a
## 1 a 1 1 6 -1
## 2 b 2 2 5 5
## 3 c 3 1 4 5
## 4 d 4 2 3 11
## 5 e 5 1 2 11
## 6 f 6 2 1 17
df
## ID x y z a
## 1 a 1 1 6 -1
## 2 b 2 2 5 5
## 3 c 3 1 4 5
## 4 d 4 2 3 11
## 5 e 5 1 2 11
## 6 f 6 2 1 17
When we want to create a new column where each row depends on values of existing column, we can use ifelse().
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df
## ID x y z a
## 1 a 1 1 6 0
## 2 b 2 2 5 0
## 3 c 3 1 4 0
## 4 d 4 2 3 1
## 5 e 5 1 2 1
## 6 f 6 2 1 1
2.7.11 Remove Duplicated Observation
To remove duplicate, the function is unique().
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 a 3 1
## 4 b 4 2
2.7.12 Collapse Data by Group
There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.
The following calculate the mean of data by group.
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf
## df.ID x
## 1 a 1.666667
## 2 b 2.666667
The following is similar by use by(). H
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf
## df$ID: a
## [1] 1.666667
## --------------------------------------------------------
## df$ID: b
## [1] 2.666667
However, it essentially gives us a list object.
We can use cbind() to convert into a column vector:
cdf <-cbind(newdf)
cdf
## newdf
## a 1.666667
## b 2.666667
Or we can use rbind() to convert into a row vector:
rdf <-rbind(newdf)
rdf
## a b
## newdf 1.666667 2.666667