## 2.7 Data Management

We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subsetting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.

Load a standard dataset and store it as a dataframe

df <- data.frame(mtcars)

### 2.7.2 Peeping Data

We may use the head() to checking the top rows and tail() to check bottom rows.

Here is the example data:

x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)

Here we look at the first three row:

head(df, 3) 
##   x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2

Here we look at the last three row:

tail(df, 3)
##   x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3

### 2.7.3 Sorting Data

Sorting data by giving multiple criteria using order.

The following code first sort the data by y (in an ascending order) and then break tie using z if they have the same y.

newdf <- df[order(df$y,df$z),]
newdf
##   x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1

### 2.7.4 Joining dataframes

To merge two dataframes horizontally by joining through an unique identifier, one may use merge().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3
##   ID x y
## 1  a 1 4
## 2  b 2 3

If two dataframes are just to join horizontally without an unique identifier, then use cbind().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("b","a"), y=c(3,4))
df3 <-cbind(df1,df2)
df3
##   ID x ID y
## 1  a 1  b 3
## 2  b 2  a 4

### 2.7.5 Stacking Dataframe

If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.

df1<-data.frame(ID=c("a","b"),
x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"),
x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4
##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1
## 4  d 4 2
## 5  e 5 1
## 6  f 6 2

### 2.7.6 Selecting Columns (Variables)

The following code selecting columns 1 to 3.

df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)
##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

The following code drops columns 1 to 2.

df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)
##   y z
## 1 1 6
## 2 2 5
## 3 1 4

We can delete columns by name.

df <- data.frame(ID=c("a","b","c","d","e","f"),
x=c(1,2,3,4,5,6),
y=c(1,2,1,2,1,2),
z=c(6,5,4,3,2,1))
df$z<- NULL head(df,3) ## ID x y ## 1 a 1 1 ## 2 b 2 2 ## 3 c 3 1 ### 2.7.7 Selecting Rows (observation) based on row number The functions head() and tail() allow as to directly obtain rows from the top and bottom. The following code selects the top 3 observations. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf<-head(df,3) newdf ## ID x y z ## 1 a 1 1 6 ## 2 b 2 2 5 ## 3 c 3 1 4 The following code drops the last 2 observations. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf<-head(df,-2) newdf ## ID x y z ## 1 a 1 1 6 ## 2 b 2 2 5 ## 3 c 3 1 4 ## 4 d 4 2 3 The following code selects the bottom 3 observations. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- tail(df,3) newdf ## ID x y z ## 4 d 4 2 3 ## 5 e 5 1 2 ## 6 f 6 2 1 The following code drops the top 2 observations. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- tail(df,-2) newdf ## ID x y z ## 3 c 3 1 4 ## 4 d 4 2 3 ## 5 e 5 1 2 ## 6 f 6 2 1 The following code selects observations from row 2 to row 4. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- df[c(2:4),] head(newdf,3) ## ID x y z ## 2 b 2 2 5 ## 3 c 3 1 4 ## 4 d 4 2 3 The following code excludes observations from row 2 to row 4. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- df[-c(2:4),] head(newdf,3) ## ID x y z ## 1 a 1 1 6 ## 5 e 5 1 2 ## 6 f 6 2 1 ### 2.7.8 Selecting Rows (observation) based on condition To select observations based on conditions, we may use which. The following code selects observations such that $$y=1$$. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- df[which(df$y==1), ]
newdf
##   ID x y z
## 1  a 1 1 6
## 3  c 3 1 4
## 5  e 5 1 2

Here, which(df$y==1) returns a vector of row numbers such that $$y=1$$. The following code chooses observation that $$y=1$$ and $$x>1$$. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- df[which(df$y==1 & df$x>1), ] newdf ## ID x y z ## 3 c 3 1 4 ## 5 e 5 1 2 Here, which(df$y==1 & df$x>1) returns a vector of row numbers such that both $$y=1$$ and $$x>1$$. ### 2.7.9 Selecting rows (observations) and columns (variables) To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use selectâ€™â€™. The following code selects columns of y and z when $$y=1$$ and $$x>1$$. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) newdf <- subset(df, y== 2 & x>1, select= c(y,z)) newdf ## y z ## 2 2 5 ## 4 2 3 ## 6 2 1 ### 2.7.10 Create New Column If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized. df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) df$a <- 2*df$x + 3*df$y -df$z df ## ID x y z a ## 1 a 1 1 6 -1 ## 2 b 2 2 5 5 ## 3 c 3 1 4 5 ## 4 d 4 2 3 11 ## 5 e 5 1 2 11 ## 6 f 6 2 1 17 df ## ID x y z a ## 1 a 1 1 6 -1 ## 2 b 2 2 5 5 ## 3 c 3 1 4 5 ## 4 d 4 2 3 11 ## 5 e 5 1 2 11 ## 6 f 6 2 1 17 When we want to create a new column where each row depends on values of existing column, we can use ifelse(). df <- data.frame(ID=c("a","b","c","d","e","f"), x=c(1,2,3,4,5,6), y=c(1,2,1,2,1,2), z=c(6,5,4,3,2,1)) df$a <- ifelse(df$x>3, 1,0) df ## ID x y z a ## 1 a 1 1 6 0 ## 2 b 2 2 5 0 ## 3 c 3 1 4 0 ## 4 d 4 2 3 1 ## 5 e 5 1 2 1 ## 6 f 6 2 1 1 ### 2.7.11 Remove Duplicated Observation To remove duplicate, the function is unique(). df <- data.frame(ID=c("a","b","a","b","a","b"), x=c(1,2,3,4,1,2), y=c(1,2,1,2,1,2)) newdf <- unique(df) newdf ## ID x y ## 1 a 1 1 ## 2 b 2 2 ## 3 a 3 1 ## 4 b 4 2 ### 2.7.12 Collapse Data by Group There are two ways to collapse data by applying function to group: (1) aggregate and (2) by. The following calculate the mean of data by group. df <- data.frame(ID=c("a","b","a","b","a","b"), x=c(1,2,3,4,1,2), y=c(1,2,1,2,1,2)) newdf <- aggregate(df$x, by=data.frame(df$ID), mean) newdf ## df.ID x ## 1 a 1.666667 ## 2 b 2.666667 The following is similar by use by(). H df <- data.frame(ID=c("a","b","a","b","a","b"), x=c(1,2,3,4,1,2), y=c(1,2,1,2,1,2)) newdf <-by(df$x, df$ID, mean) newdf ## df$ID: a
## [1] 1.666667
## ------------------------------------------------------------
## df\$ID: b
## [1] 2.666667

However, it essentially gives us a list object.

We can use cbind() to convert into a column vector:

cdf <-cbind(newdf)
cdf
##      newdf
## a 1.666667
## b 2.666667

Or we can use rbind() to convert into a row vector:

rdf <-rbind(newdf)
rdf
##              a        b
## newdf 1.666667 2.666667