2.7 Data Management

We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subsetting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.

2.7.1 Loading built-in dataset

Load a standard dataset and store it as a dataframe

df <- data.frame(mtcars)

2.7.2 Peeping Data

We may use the head() to checking the top rows and tail() to check bottom rows.

Here is the example data:

x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)

Here we look at the first three row:

head(df, 3)

##   x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2

Here we look at the last three row:

tail(df, 3)

##   x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3

2.7.3 Sorting Data

Sorting data by giving multiple criteria using order.

The following code first sort the data by y (in an ascending order) and then break tie using z if they have the same y.

newdf <- df[order(df$y,df$z),]
newdf

##   x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1

2.7.4 Joining dataframes

To merge two dataframes horizontally by joining through an unique identifier, one may use merge().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3

##   ID x y
## 1  a 1 4
## 2  b 2 3

If two dataframes are just to join horizontally without an unique identifier, then use cbind().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-cbind(df1,df2)
df3

##   ID x ID y
## 1  a 1  b 3
## 2  b 2  a 4

2.7.5 Stacking Dataframe

If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.

df1<-data.frame(ID=c("a","b"),
                x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"), 
                x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
                x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1
## 4  d 4 2
## 5  e 5 1
## 6  f 6 2

2.7.6 Selecting Columns (Variables)

The following code selecting columns 1 to 3.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

The following code drops columns 1 to 2.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)

##   y z
## 1 1 6
## 2 2 5
## 3 1 4

We can delete columns by name.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

2.7.7 Selecting Rows (observation) based on row number

The functions head() and tail() allow as to directly obtain rows from the top and bottom.

The following code selects the top 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf

##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4

The following code drops the last 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf

##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code selects the bottom 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf

##   ID x y z
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code drops the top 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf

##   ID x y z
## 3  c 3 1 4
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code selects observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)

##   ID x y z
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code excludes observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)

##   ID x y z
## 1  a 1 1 6
## 5  e 5 1 2
## 6  f 6 2 1

2.7.8 Selecting Rows (observation) based on condition

To select observations based on conditions, we may use which. The following code selects observations such that $y=1$ .

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf

##   ID x y z
## 1  a 1 1 6
## 3  c 3 1 4
## 5  e 5 1 2

Here, which(df$y==1) returns a vector of row numbers such that $y=1$ .

The following code chooses observation that $y=1$ and $x>1$ .

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf

##   ID x y z
## 3  c 3 1 4
## 5  e 5 1 2

Here, which(df$y==1 & df$x>1) returns a vector of row numbers such that both $y=1$ and $x>1$ .

2.7.9 Selecting rows (observations) and columns (variables)

To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use select’’.

The following code selects columns of y and z when $y=1$ and $x>1$ .

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf

##   y z
## 2 2 5
## 4 2 3
## 6 2 1

2.7.10 Create New Column

If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df

##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

df

##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

When we want to create a new column where each row depends on values of existing column, we can use ifelse().

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

2.7.11 Remove Duplicated Observation

To remove duplicate, the function is unique().

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  a 3 1
## 4  b 4 2

2.7.12 Collapse Data by Group

There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.

The following calculate the mean of data by group.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf

##   df.ID        x
## 1     a 1.666667
## 2     b 2.666667

The following is similar by use by(). H

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf

## df$ID: a
## [1] 1.666667
## -------------------------------------------------------- 
## df$ID: b
## [1] 2.666667

However, it essentially gives us a list object.

We can use cbind() to convert into a column vector:

cdf <-cbind(newdf)
cdf

##      newdf
## a 1.666667
## b 2.666667

Or we can use rbind() to convert into a row vector:

rdf <-rbind(newdf)
rdf

##              a        b
## newdf 1.666667 2.666667