2.6 Data Management
We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subseting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.
2.6.1 Loading built-in dataset
Load a standard dataset and store it as a dataframe
df <- data.frame(mtcars)
2.6.2 Peeping Data
We may use the head to checking the top rows and tail to check bottom rows.
x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)
head(df, 3) #first 3 row
## x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2
tail(df, 3) #last 3 row
## x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3
2.6.3 Sorting Data
Sorting data by giving multiple criteria using order.
The following code first sort the data by y (in an ascending order) and then break tie using disp if they have the same z.
newdf <- df[order(df$y,df$z),]
newdf
## x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1
2.6.4 Joining dataframes
To merge two dataframes horizontally by joining through an unique identifier, one may use merge().
df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("a","b"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3
## ID x y
## 1 a 1 3
## 2 b 2 4
If two dataframes are just to join horizontally without an unique identifier, then use cbind().
df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("a","b"), y=c(3,4))
df3 <-cbind(df1,df2)
df3
## ID x ID y
## 1 a 1 a 3
## 2 b 2 b 4
2.6.5 Stacking Dataframe
If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.
df1<-data.frame(ID=c("a","b"),
x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"),
x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
## 4 d 4 2
## 5 e 5 1
## 6 f 6 2
2.6.6 Selecting Variables
The following code selecting columns 1 to 3.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
The following code drops columns 1 to 2.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)
## y z
## 1 1 6
## 2 2 5
## 3 1 4
We can delete columns by name.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 c 3 1
2.6.7 Selecting Observation based on row number
The functions head and tail allows as to directly obtain rows from the top and bottom.
The following code selects the top 3 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf
## ID x y z
## 1 a 1 1 6
## 2 b 2 2 5
## 3 c 3 1 4
The following code drops the last 2 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf
## ID x y z
## 1 a 1 1 6
## 2 b 2 2 5
## 3 c 3 1 4
## 4 d 4 2 3
The following code selects the bottom 3 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf
## ID x y z
## 4 d 4 2 3
## 5 e 5 1 2
## 6 f 6 2 1
The following code drops the top 2 observations.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf
## ID x y z
## 3 c 3 1 4
## 4 d 4 2 3
## 5 e 5 1 2
## 6 f 6 2 1
The following code selects observations from row 2 to row 4.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)
## ID x y z
## 2 b 2 2 5
## 3 c 3 1 4
## 4 d 4 2 3
The following code excludes observations from row 2 to row 4.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)
## ID x y z
## 1 a 1 1 6
## 5 e 5 1 2
## 6 f 6 2 1
2.6.8 Selecting Observation based on condition
To select observations based on conditions, we may use ``which’’. The following code selects observations such that \(y=1\).
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf
## ID x y z
## 1 a 1 1 6
## 3 c 3 1 4
## 5 e 5 1 2
The following code chooses observation that \(y=1\) and \(x>1\).
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf
## ID x y z
## 3 c 3 1 4
## 5 e 5 1 2
2.6.9 Selecting Observation and Columns
To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use ``select’’.
The following code selects columns of y and z when \(y=1\) and \(x>1\).
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf
## y z
## 2 2 5
## 4 2 3
## 6 2 1
2.6.10 Create New Column
If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df
## ID x y z a
## 1 a 1 1 6 -1
## 2 b 2 2 5 5
## 3 c 3 1 4 5
## 4 d 4 2 3 11
## 5 e 5 1 2 11
## 6 f 6 2 1 17
df
## ID x y z a
## 1 a 1 1 6 -1
## 2 b 2 2 5 5
## 3 c 3 1 4 5
## 4 d 4 2 3 11
## 5 e 5 1 2 11
## 6 f 6 2 1 17
When we want to create a new column where each row depends on values of existing column, we can use ifelse().
df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df
## ID x y z a
## 1 a 1 1 6 0
## 2 b 2 2 5 0
## 3 c 3 1 4 0
## 4 d 4 2 3 1
## 5 e 5 1 2 1
## 6 f 6 2 1 1
2.6.11 Remove Duplicated Observation
To remove duplicate, the function is unique().
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf
## ID x y
## 1 a 1 1
## 2 b 2 2
## 3 a 3 1
## 4 b 4 2
2.6.12 Collapse Data by Group
There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.
The following calculate the mean of data by group.
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf
## df.ID x
## 1 a 1.666667
## 2 b 2.666667
The following is similar by use by()
df <- data.frame(ID=c("a","b","a","b","a","b"),
x=c(1,2,3,4,1,2),
y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf
## df$ID: a
## [1] 1.666667
## --------------------------------------------------------
## df$ID: b
## [1] 2.666667