2.7 Data Management
We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subsetting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.
2.7.1 Loading built-in dataset
Load a standard dataset and store it as a dataframe
df <- data.frame(mtcars)
2.7.2 Peeping Data
We may use the head() to checking the top rows and tail() to check bottom rows.
Here is the example data:
x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)
Here we look at the first three row:
head(df, 3)
## x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2
Here we look at the last three row:
tail(df, 3)
## x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3
2.7.3 Sorting Data
Sorting data by giving multiple criteria using order.
The following code first sort the data by y (in an ascending order) and then break tie using z if they have the same y.
newdf <- df[order(df$y,df$z),]
newdf
## x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1
2.7.4 Joining dataframes
To merge two dataframes horizontally by joining through an unique identifier, one may use merge().
df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3
## ID x y
## 1 a 1 4
## 2 b 2 3
If two dataframes are just to join horizontally without an unique identifier, then use cbind().
df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-cbind(df1,df2)
df3
## ID x ID y
## 1 a 1 b 3
## 2 b 2 a 4
2.7.5 Stacking Dataframe
If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.
df1<-data.frame(ID=c("a","b"),
x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"),
x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
## 4 d 4 2
## 5 e 5 1
## 6 f 6 2
2.7.6 Selecting Columns (Variables)
The following code selecting columns 1 to 3.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
The following code drops columns 1 to 2.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)
## y z
## 1 1 6
## 2 2 5
## 3 1 4
We can delete columns by name.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
2.7.7 Selecting Rows (observation) based on row number
The functions head() and tail() allow as to directly obtain rows from the top and bottom.
The following code selects the top 3 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf
## ID x y z
## 1 a 1 1 6
## 2 b 2 2 5
## 3 c 3 1 4
The following code drops the last 2 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf
## ID x y z
## 1 a 1 1 6
## 2 b 2 2 5
## 3 c 3 1 4
## 4 d 4 2 3
The following code selects the bottom 3 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf
## ID x y z
## 4 d 4 2 3
## 5 e 5 1 2
## 6 f 6 2 1
The following code drops the top 2 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf
## ID x y z
## 3 c 3 1 4
## 4 d 4 2 3
## 5 e 5 1 2
## 6 f 6 2 1
The following code selects observations from row 2 to row 4.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)
## ID x y z
## 2 b 2 2 5
## 3 c 3 1 4
## 4 d 4 2 3
The following code excludes observations from row 2 to row 4.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)
## ID x y z
## 1 a 1 1 6
## 5 e 5 1 2
## 6 f 6 2 1
2.7.8 Selecting Rows (observation) based on condition
To select observations based on conditions, we may use which. The following code selects observations such that y=1.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf
## ID x y z
## 1 a 1 1 6
## 3 c 3 1 4
## 5 e 5 1 2
Here, which(df$y==1) returns a vector of row numbers such that y=1.
The following code chooses observation that y=1 and x>1.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf
## ID x y z
## 3 c 3 1 4
## 5 e 5 1 2
Here, which(df$y==1 & df$x>1) returns a vector of row numbers such that both y=1 and x>1.
2.7.9 Selecting rows (observations) and columns (variables)
To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use select’’.
The following code selects columns of y and z when y=1 and x>1.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf
## y z
## 2 2 5
## 4 2 3
## 6 2 1
2.7.10 Create New Column
If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df
## ID x y z a
## 1 a 1 1 6 -1
## 2 b 2 2 5 5
## 3 c 3 1 4 5
## 4 d 4 2 3 11
## 5 e 5 1 2 11
## 6 f 6 2 1 17
df
## ID x y z a
## 1 a 1 1 6 -1
## 2 b 2 2 5 5
## 3 c 3 1 4 5
## 4 d 4 2 3 11
## 5 e 5 1 2 11
## 6 f 6 2 1 17
When we want to create a new column where each row depends on values of existing column, we can use ifelse().
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df
## ID x y z a
## 1 a 1 1 6 0
## 2 b 2 2 5 0
## 3 c 3 1 4 0
## 4 d 4 2 3 1
## 5 e 5 1 2 1
## 6 f 6 2 1 1
2.7.11 Remove Duplicated Observation
To remove duplicate, the function is unique().
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 a 3 1
## 4 b 4 2
2.7.12 Collapse Data by Group
There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.
The following calculate the mean of data by group.
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf
## df.ID x
## 1 a 1.666667
## 2 b 2.666667
The following is similar by use by(). H
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf
## df$ID: a
## [1] 1.666667
## --------------------------------------------------------
## df$ID: b
## [1] 2.666667
However, it essentially gives us a list object.
We can use cbind() to convert into a column vector:
cdf <-cbind(newdf)
cdf
## newdf
## a 1.666667
## b 2.666667
Or we can use rbind() to convert into a row vector:
rdf <-rbind(newdf)
rdf
## a b
## newdf 1.666667 2.666667