2.6 Data Management

We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subseting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.

2.6.1 Loading built-in dataset

Load a standard dataset and store it as a dataframe

df <- data.frame(mtcars)

2.6.2 Peeping Data

We may use the head to checking the top rows and tail to check bottom rows.

x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)
head(df, 3) #first 3 row
##   x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2
tail(df, 3) #last 3 row
##   x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3

2.6.3 Sorting Data

Sorting data by giving multiple criteria using order.

The following code first sort the data by y (in an ascending order) and then break tie using disp if they have the same z.

newdf <- df[order(df$y,df$z),]
newdf
##   x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1

2.6.4 Joining dataframes

To merge two dataframes horizontally by joining through an unique identifier, one may use merge().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("a","b"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3
##   ID x y
## 1  a 1 3
## 2  b 2 4

If two dataframes are just to join horizontally without an unique identifier, then use cbind().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("a","b"), y=c(3,4))
df3 <-cbind(df1,df2)
df3
##   ID x ID y
## 1  a 1  a 3
## 2  b 2  b 4

2.6.5 Stacking Dataframe

If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.

df1<-data.frame(ID=c("a","b"),
                x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"), 
                x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
                x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4
##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1
## 4  d 4 2
## 5  e 5 1
## 6  f 6 2

2.6.6 Selecting Variables

The following code selecting columns 1 to 3.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)
##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

The following code drops columns 1 to 2.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)
##   y z
## 1 1 6
## 2 2 5
## 3 1 4

We can delete columns by name.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)
##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

2.6.7 Selecting Observation based on row number

The functions head and tail allows as to directly obtain rows from the top and bottom.

The following code selects the top 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf
##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4

The following code drops the last 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf
##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code selects the bottom 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf
##   ID x y z
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code drops the top 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf
##   ID x y z
## 3  c 3 1 4
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code selects observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)
##   ID x y z
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code excludes observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)
##   ID x y z
## 1  a 1 1 6
## 5  e 5 1 2
## 6  f 6 2 1

2.6.8 Selecting Observation based on condition

To select observations based on conditions, we may use which’’. The following code selects observations such that y=1.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf
##   ID x y z
## 1  a 1 1 6
## 3  c 3 1 4
## 5  e 5 1 2

The following code chooses observation that y=1 and x>1.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf
##   ID x y z
## 3  c 3 1 4
## 5  e 5 1 2

2.6.9 Selecting Observation and Columns

To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use select’’.

The following code selects columns of y and z when y=1 and x>1.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf
##   y z
## 2 2 5
## 4 2 3
## 6 2 1

2.6.10 Create New Column

If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df
##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17
df
##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

When we want to create a new column where each row depends on values of existing column, we can use ifelse().

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df
##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

2.6.11 Remove Duplicated Observation

To remove duplicate, the function is unique().

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf
##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  a 3 1
## 4  b 4 2

2.6.12 Collapse Data by Group

There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.

The following calculate the mean of data by group.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf
##   df.ID        x
## 1     a 1.666667
## 2     b 2.666667

The following is similar by use by()

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf
## df$ID: a
## [1] 1.666667
## -------------------------------------------------------- 
## df$ID: b
## [1] 2.666667