2.7 Data Management

We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subsetting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.

2.7.1 Loading built-in dataset

Load a standard dataset and store it as a dataframe

df <- data.frame(mtcars)

2.7.2 Peeping Data

We may use the head() to checking the top rows and tail() to check bottom rows.

Here is the example data:

x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)

Here we look at the first three row:

head(df, 3)

##   x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2

Here we look at the last three row:

tail(df, 3)

##   x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3

2.7.3 Sorting Data

Sorting data by giving multiple criteria using order.

The following code first sort the data by y (in an ascending order) and then break tie using z if they have the same y.

newdf <- df[order(df$y,df$z),]
newdf

##   x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1

2.7.4 Joining dataframes

To merge two dataframes horizontally by joining through an unique identifier, one may use merge().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3

##   ID x y
## 1  a 1 4
## 2  b 2 3

If two dataframes are just to join horizontally without an unique identifier, then use cbind().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-cbind(df1,df2)
df3

##   ID x ID y
## 1  a 1  b 3
## 2  b 2  a 4

2.7.5 Stacking Dataframe

If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.

df1<-data.frame(ID=c("a","b"),
                x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"), 
                x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
                x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1
## 4  d 4 2
## 5  e 5 1
## 6  f 6 2

2.7.6 Selecting Columns (Variables)

The following code selecting columns 1 to 3.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

The following code drops columns 1 to 2.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)

##   y z
## 1 1 6
## 2 2 5
## 3 1 4

We can delete columns by name.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

2.7.7 Selecting Rows (observation) based on row number

The functions head() and tail() allow as to directly obtain rows from the top and bottom.

The following code selects the top 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf

##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4

The following code drops the last 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf

##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code selects the bottom 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf

##   ID x y z
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code drops the top 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf

##   ID x y z
## 3  c 3 1 4
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code selects observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)

##   ID x y z
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code excludes observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)

##   ID x y z
## 1  a 1 1 6
## 5  e 5 1 2
## 6  f 6 2 1

2.7.8 Selecting Rows (observation) based on condition

To select observations based on conditions, we may use which. The following code selects observations such that $y=1$.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf

##   ID x y z
## 1  a 1 1 6
## 3  c 3 1 4
## 5  e 5 1 2

Here, which(df$y==1) returns a vector of row numbers such that $y=1$.

The following code chooses observation that $y=1$ and $x>1$.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf

##   ID x y z
## 3  c 3 1 4
## 5  e 5 1 2

Here, which(df$y==1 & df$x>1) returns a vector of row numbers such that both $y=1$ and $x>1$.

2.7.9 Selecting rows (observations) and columns (variables)

To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use ``select’’.

The following code selects columns of y and z when $y=1$ and $x>1$.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf

##   y z
## 2 2 5
## 4 2 3
## 6 2 1

2.7.10 Create New Column

If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df

##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

df

##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

When we want to create a new column where each row depends on values of existing column, we can use ifelse().

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

2.7.11 Remove Duplicated Observation

To remove duplicate, the function is unique().

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  a 3 1
## 4  b 4 2

2.7.12 Collapse Data by Group

There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.

The following calculate the mean of data by group.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf

##   df.ID        x
## 1     a 1.666667
## 2     b 2.666667

The following is similar by use by(). H

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf

## df$ID: a
## [1] 1.666667
## ------------------------------------------------------------ 
## df$ID: b
## [1] 2.666667

However, it essentially gives us a list object.

We can use cbind() to convert into a column vector:

cdf <-cbind(newdf)
cdf

##      newdf
## a 1.666667
## b 2.666667

Or we can use rbind() to convert into a row vector:

rdf <-rbind(newdf)
rdf

##              a        b
## newdf 1.666667 2.666667